[jira] [Commented] (LUCENE-8939) Shared Hit Count Early Termination
[ https://issues.apache.org/jira/browse/LUCENE-8939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929314#comment-16929314 ] Michael McCandless commented on LUCENE-8939: Thanks [~jpountz], this is an exciting improvement for concurrent search users! > Shared Hit Count Early Termination > -- > > Key: LUCENE-8939 > URL: https://issues.apache.org/jira/browse/LUCENE-8939 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Atri Sharma >Priority: Major > Fix For: 8.3 > > Time Spent: 12h 20m > Remaining Estimate: 0h > > When collecting hits across sorted segments, it should be possible to > terminate early across all slices when enough hits have been collected > globally i.e. hit count > numHits AND hit count < totalHitsThreshold -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8939) Shared Hit Count Early Termination
[ https://issues.apache.org/jira/browse/LUCENE-8939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927570#comment-16927570 ] Michael McCandless commented on LUCENE-8939: Is this issue done? Will we backport to 8.x? > Shared Hit Count Early Termination > -- > > Key: LUCENE-8939 > URL: https://issues.apache.org/jira/browse/LUCENE-8939 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Atri Sharma >Priority: Major > Time Spent: 12h 20m > Remaining Estimate: 0h > > When collecting hits across sorted segments, it should be possible to > terminate early across all slices when enough hits have been collected > globally i.e. hit count > numHits AND hit count < totalHitsThreshold -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7282) search APIs should take advantage of index sort by default
[ https://issues.apache.org/jira/browse/LUCENE-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926850#comment-16926850 ] Michael McCandless commented on LUCENE-7282: Aha, thanks [~atris]! > search APIs should take advantage of index sort by default > -- > > Key: LUCENE-7282 > URL: https://issues.apache.org/jira/browse/LUCENE-7282 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > > Spinoff from LUCENE-6766, where we made it very easy to have Lucene sort > documents in the index (at merge time). > An index-time sort is powerful because if you then search that index by the > same sort (or by a "prefix" of it), you can early-terminate per segment once > you've collected enough hits. But doing this by default would mean accepting > an approximate hit count, and could not be used in cases that need to see > every hit, e.g. if you are also faceting. > Separately, `TermQuery` on the leading sort field can be very fast since we > can advance to the first docID, and only match to the last docID for the > requested value. This would not be approximate, and should be lower risk / > easier. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7282) search APIs should take advantage of index sort by default
[ https://issues.apache.org/jira/browse/LUCENE-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926839#comment-16926839 ] Michael McCandless commented on LUCENE-7282: Do we optimize the case where an exact or range DV query clause is "congruent" with index sort? E.g. say my index sort is a {{DocValues.NUMERIC}} field {{foobar}} and then my query has a clause {{foobar=17}} then we can efficiently per segment skip to the {{docid}} range for the value {{17}} even if the user did not index dimensional points for that field. I thought we had an issue open for this but I can't find it now ... > search APIs should take advantage of index sort by default > -- > > Key: LUCENE-7282 > URL: https://issues.apache.org/jira/browse/LUCENE-7282 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > > Spinoff from LUCENE-6766, where we made it very easy to have Lucene sort > documents in the index (at merge time). > An index-time sort is powerful because if you then search that index by the > same sort (or by a "prefix" of it), you can early-terminate per segment once > you've collected enough hits. But doing this by default would mean accepting > an approximate hit count, and could not be used in cases that need to see > every hit, e.g. if you are also faceting. > Separately, `TermQuery` on the leading sort field can be very fast since we > can advance to the first docID, and only match to the last docID for the > requested value. This would not be approximate, and should be lower risk / > easier. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8963) Allow Collectors To "Publish" If They Can Be Used In Concurrent Search
[ https://issues.apache.org/jira/browse/LUCENE-8963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923284#comment-16923284 ] Michael McCandless commented on LUCENE-8963: Do we have examples of collectors in Lucene today that are single-threaded? The core collectors, at least {{TopFieldCollector}} and {{TopDocsCollector}} seem to be OK since {{IndexSearcher}} makes a {{CollectorManager}} that uses {{TopDocs.merge}} in the end. So maybe as long as a {{CollectorManager}} is available that implies it is thread safe? > Allow Collectors To "Publish" If They Can Be Used In Concurrent Search > -- > > Key: LUCENE-8963 > URL: https://issues.apache.org/jira/browse/LUCENE-8963 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > > There is an implied assumption today that all we need to run a query > concurrently is a CollectorManager implementation. While that is true, there > might be some corner cases where a Collector's semantics do not allow it to > be concurrently executed (think of ES's aggregates). If a user manages to > write a CollectorManager with a Collector that is not really concurrent > friendly, we could end up in an undefined state. > > This Jira is more of a rhetorical discussion, and to explore if we should > allow Collectors to implement an API which simply returns a boolean > signifying if a Collector is parallel ready or not. The default would be > true, until a Collector explicitly overrides it? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8884) Add Directory wrapper to track per-query IO counters
[ https://issues.apache.org/jira/browse/LUCENE-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-8884: --- Attachment: LUCENE-8884.patch Status: Open (was: Open) Another iteration folding [~rcmuir]'s feedback. I was worried that the thread that called {{clone()}} may not be the same thread that then consumes the {{IndexInput}} and added an assertion, but it looks like it's OK. I added another random test too, and improved javadocs. I could not eliminate the required {{setKeyForThread}} call because even in the single threaded case, where only one thread executes the query across all segments, the directory wrapper still needs to know which query that is to track its IO counters. I haven't tested performance impact of this but it's likely minor now since we now retrieve the counters on {{clone()}} instead of on every IO operation. > Add Directory wrapper to track per-query IO counters > > > Key: LUCENE-8884 > URL: https://issues.apache.org/jira/browse/LUCENE-8884 > Project: Lucene - Core > Issue Type: Improvement > Components: core/store >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-8884.patch, LUCENE-8884.patch > > > Lucene's IO abstractions ({{Directory, IndexInput/Output}}) make it really > easy to track counters of how many IOPs and net bytes are read for each > query, which is a useful metric to track/aggregate/alarm on in production or > dev benchmarks. > At my day job we use these wrappers in our nightly benchmarks to catch any > accidental performance regressions. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922559#comment-16922559 ] Michael McCandless commented on LUCENE-8962: Thanks [~dsmiley]. That sounds like a nice improvement to TMP, but I would want to match the more aggressive merging of those tiny segments w/ a refresh or commit, not run them in general since I think for pure indexing that'd hurt indexing throughput. I think the tricky part of this change is fixing {{IndexWriter}} refresh or commit to let the merge policy know it should now aggressively merge small segments, within a time or total size budget or something, while the refresh/commit operation waits, so that the returned segments have been merged, even while (concurrently) new segments are flushed. Synchronous merges in merge scheduler sound interesting for this use case – maybe open a separate issue for that? > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
Michael McCandless created LUCENE-8962: -- Summary: Can we merge small segments during refresh, for faster searching? Key: LUCENE-8962 URL: https://issues.apache.org/jira/browse/LUCENE-8962 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: Michael McCandless With near-real-time search we ask {{IndexWriter}} to write all in-memory segments to disk and open an {{IndexReader}} to search them, and this is typically a quick operation. However, when you use many threads for concurrent indexing, {{IndexWriter}} will accumulate write many small segments during {{refresh}} and this then adds search-time cost as searching must visit all of these tiny segments. The merge policy would normally quickly coalesce these small segments if given a little time ... so, could we somehow improve {{IndexWriter'}}s refresh to optionally kick off merge policy to merge segments below some threshold before opening the near-real-time reader? It'd be a bit tricky because while we are waiting for merges, indexing may continue, and new segments may be flushed, but those new segments shouldn't be included in the point-in-time segments returned by refresh ... One could almost do this on top of Lucene today, with a custom merge policy, and some hackity logic to have the merge policy target small segments just written by refresh, but it's tricky to then open a near-real-time reader, excluding newly flushed but including newly merged segments since the refresh originally finished ... I'm not yet sure how best to solve this, so I wanted to open an issue for discussion! -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8403) Support 'filtered' term vectors - don't require all terms to be present
[ https://issues.apache.org/jira/browse/LUCENE-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920099#comment-16920099 ] Michael McCandless commented on LUCENE-8403: Maybe first test the performance of the separate field? If the double-analysis is really a problem, you could use {{CachingTokenFilter}} to analyze only once? > Support 'filtered' term vectors - don't require all terms to be present > --- > > Key: LUCENE-8403 > URL: https://issues.apache.org/jira/browse/LUCENE-8403 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Braun >Priority: Minor > Attachments: LUCENE-8403.patch > > > The genesis of this was a conversation and idea from [~dsmiley] several years > ago. > In order to optimize term vector storage, we may not actually need all tokens > to be present in the term vectors - and if so, ideally our codec could just > opt not to store them. > I attempted to fork the standard codec and override the TermVectorsFormat and > TermVectorsWriter to ignore storing certain Terms within a field. This > worked, however, CheckIndex checks that the terms present in the standard > postings are also present in the TVs, if TVs enabled. So this then doesn't > work as 'valid' according to CheckIndex. > Can the TermVectorsFormat be made in such a way to support configuration of > tokens that should not be stored (benefits: less storage, more optimal > retrieval per doc)? Is this valuable to the wider community? Is there a way > we can design this to not break CheckIndex's contract while at the same time > lessening storage for unneeded tokens? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8884) Add Directory wrapper to track per-query IO counters
[ https://issues.apache.org/jira/browse/LUCENE-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909302#comment-16909302 ] Michael McCandless commented on LUCENE-8884: Thanks for the review [~rcmuir]. We need the thread locals because we pass {{ExecutorService}} to {{IndexSearcher}} to keep our long-pole query latencies down. So we need some way to associate searcher thread with the query it's handling, but maybe we can make that less invasive, e.g. default better for the more common single-threaded query case? {quote}must you call get on every op vs once in the ctor? after all thats why we have clone? ( should not have thread issues ) {quote} Ahh that's a good point – once the {{IndexInput}} is created, only one thread will use it – I'll fix that! This should reduce overhead substantially, maybe enough to run in production by default. {quote}readint has a second spurious call. {quote} Woops, I'll fix that too. > Add Directory wrapper to track per-query IO counters > > > Key: LUCENE-8884 > URL: https://issues.apache.org/jira/browse/LUCENE-8884 > Project: Lucene - Core > Issue Type: Improvement > Components: core/store >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-8884.patch > > > Lucene's IO abstractions ({{Directory, IndexInput/Output}}) make it really > easy to track counters of how many IOPs and net bytes are read for each > query, which is a useful metric to track/aggregate/alarm on in production or > dev benchmarks. > At my day job we use these wrappers in our nightly benchmarks to catch any > accidental performance regressions. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8884) Add Directory wrapper to track per-query IO counters
[ https://issues.apache.org/jira/browse/LUCENE-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907361#comment-16907361 ] Michael McCandless commented on LUCENE-8884: I plan to push this soon ... just adds this new directory wrapper to {{misc}} module. > Add Directory wrapper to track per-query IO counters > > > Key: LUCENE-8884 > URL: https://issues.apache.org/jira/browse/LUCENE-8884 > Project: Lucene - Core > Issue Type: Improvement > Components: core/store >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-8884.patch > > > Lucene's IO abstractions ({{Directory, IndexInput/Output}}) make it really > easy to track counters of how many IOPs and net bytes are read for each > query, which is a useful metric to track/aggregate/alarm on in production or > dev benchmarks. > At my day job we use these wrappers in our nightly benchmarks to catch any > accidental performance regressions. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8947) Indexing fails with "too many tokens for field" when using custom term frequencies
[ https://issues.apache.org/jira/browse/LUCENE-8947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907330#comment-16907330 ] Michael McCandless commented on LUCENE-8947: Indeed we disable norms ... that’s a good idea to skip length accumulation when norms are disabled. I’ll give that a shot. > Indexing fails with "too many tokens for field" when using custom term > frequencies > -- > > Key: LUCENE-8947 > URL: https://issues.apache.org/jira/browse/LUCENE-8947 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 7.5 >Reporter: Michael McCandless >Priority: Major > > We are using custom term frequencies (LUCENE-7854) to index per-token scoring > signals, however for one document that had many tokens and those tokens had > fairly large (~998,000) scoring signals, we hit this exception: > {noformat} > 2019-08-05T21:32:37,048 [ERROR] (LuceneIndexing-3-thread-3) > com.amazon.lucene.index.IndexGCRDocument: Failed to index doc: > java.lang.IllegalArgumentException: too many tokens for field "foobar" > at > org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:825) > at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430) > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394) > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297) > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450) > at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291) > at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264) > {noformat} > This is happening in this code in {{DefaultIndexingChain.java}}: > {noformat} > try { > invertState.length = Math.addExact(invertState.length, > invertState.termFreqAttribute.getTermFrequency()); > } catch (ArithmeticException ae) { > throw new IllegalArgumentException("too many tokens for field \"" + > field.name() + "\""); > }{noformat} > Where Lucene is accumulating the total length (number of tokens) for the > field. But total length doesn't really make sense if you are using custom > term frequencies to hold arbitrary scoring signals? Or, maybe it does make > sense, if user is using this as simple boosting, but maybe we should allow > this length to be a {{long}}? -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8369) Remove the spatial module as it is obsolete
[ https://issues.apache.org/jira/browse/LUCENE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905566#comment-16905566 ] Michael McCandless commented on LUCENE-8369: +1 for option 1 above. > Remove the spatial module as it is obsolete > --- > > Key: LUCENE-8369 > URL: https://issues.apache.org/jira/browse/LUCENE-8369 > Project: Lucene - Core > Issue Type: Task > Components: modules/spatial >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Attachments: LUCENE-8369.patch > > > The "spatial" module is at this juncture nearly empty with only a couple > utilities that aren't used by anything in the entire codebase -- > GeoRelationUtils, and MortonEncoder. Perhaps it should have been removed > earlier in LUCENE-7664 which was the removal of GeoPointField which was > essentially why the module existed. Better late than never. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8947) Indexing fails with "too many tokens for field" when using custom term frequencies
Michael McCandless created LUCENE-8947: -- Summary: Indexing fails with "too many tokens for field" when using custom term frequencies Key: LUCENE-8947 URL: https://issues.apache.org/jira/browse/LUCENE-8947 Project: Lucene - Core Issue Type: Improvement Affects Versions: 7.5 Reporter: Michael McCandless We are using custom term frequencies (LUCENE-7854) to index per-token scoring signals, however for one document that had many tokens and those tokens had fairly large (~998,000) scoring signals, we hit this exception: {noformat} 2019-08-05T21:32:37,048 [ERROR] (LuceneIndexing-3-thread-3) com.amazon.lucene.index.IndexGCRDocument: Failed to index doc: java.lang.IllegalArgumentException: too many tokens for field "foobar" at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:825) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297) at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450) at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291) at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264) {noformat} This is happening in this code in {{DefaultIndexingChain.java}}: {noformat} try { invertState.length = Math.addExact(invertState.length, invertState.termFreqAttribute.getTermFrequency()); } catch (ArithmeticException ae) { throw new IllegalArgumentException("too many tokens for field \"" + field.name() + "\""); }{noformat} Where Lucene is accumulating the total length (number of tokens) for the field. But total length doesn't really make sense if you are using custom term frequencies to hold arbitrary scoring signals? Or, maybe it does make sense, if user is using this as simple boosting, but maybe we should allow this length to be a {{long}}? -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8369) Remove the spatial module as it is obsolete
[ https://issues.apache.org/jira/browse/LUCENE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893187#comment-16893187 ] Michael McCandless commented on LUCENE-8369: {quote}Lots of awesome functionality _commonly_ needed in search is in our modules – like highlighting, autocomplete, and spellcheck, to name a few. Why should spatial be an exception? {quote} Well, other examples are default analysis ({{StandardAnalyzer}}), common queries (versus exotic queries in the queries module), most {{Directory}} implementations, where we have some common choices in core and more exotic choices in our modules. I think it (the "common" classes and the "exotic" ones) is a helpful distinction for our users for areas that have many many options. [~nknize] can you give a concrete example where the code sharing is making things difficult? Can we simply make the necessary APIs public and marked {{@lucene.internal}} in our core spatial classes? > Remove the spatial module as it is obsolete > --- > > Key: LUCENE-8369 > URL: https://issues.apache.org/jira/browse/LUCENE-8369 > Project: Lucene - Core > Issue Type: Task > Components: modules/spatial >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Attachments: LUCENE-8369.patch > > > The "spatial" module is at this juncture nearly empty with only a couple > utilities that aren't used by anything in the entire codebase -- > GeoRelationUtils, and MortonEncoder. Perhaps it should have been removed > earlier in LUCENE-7664 which was the removal of GeoPointField which was > essentially why the module existed. Better late than never. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8865) Use incoming thread for execution if IndexSearcher has an executor
[ https://issues.apache.org/jira/browse/LUCENE-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890429#comment-16890429 ] Michael McCandless commented on LUCENE-8865: Alas, I ran our internal benchmarks (production queries on production documents, measuring red-line QPS and long-pole query latencies at 10% capacity) and I could not measure any change due to this fix – it seems to be in the noise. I was hoping for a small gain due to one fewer thread context switch ... but I still think the change is a good one! Thanks [~simonw]! > Use incoming thread for execution if IndexSearcher has an executor > --- > > Key: LUCENE-8865 > URL: https://issues.apache.org/jira/browse/LUCENE-8865 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Simon Willnauer >Priority: Major > Fix For: master (9.0), 8.2 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > Today we don't utilize the incoming thread for a search when IndexSearcher > has an executor. This thread is only idleing but can be used to execute a > search > once all other collectors are dispatched. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8884) Add Directory wrapper to track per-query IO counters
[ https://issues.apache.org/jira/browse/LUCENE-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-8884: --- Attachment: LUCENE-8884.patch Status: Open (was: Open) Trying again to attach first cut patch! > Add Directory wrapper to track per-query IO counters > > > Key: LUCENE-8884 > URL: https://issues.apache.org/jira/browse/LUCENE-8884 > Project: Lucene - Core > Issue Type: Improvement > Components: core/store >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-8884.patch > > > Lucene's IO abstractions ({{Directory, IndexInput/Output}}) make it really > easy to track counters of how many IOPs and net bytes are read for each > query, which is a useful metric to track/aggregate/alarm on in production or > dev benchmarks. > At my day job we use these wrappers in our nightly benchmarks to catch any > accidental performance regressions. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8884) Add Directory wrapper to track per-query IO counters
[ https://issues.apache.org/jira/browse/LUCENE-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16887154#comment-16887154 ] Michael McCandless commented on LUCENE-8884: Argh!! Not sure how I messed that up ... I’ll fix once I have access to laptop again. Thanks for checking [~jpountz]! > Add Directory wrapper to track per-query IO counters > > > Key: LUCENE-8884 > URL: https://issues.apache.org/jira/browse/LUCENE-8884 > Project: Lucene - Core > Issue Type: Improvement > Components: core/store >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > > Lucene's IO abstractions ({{Directory, IndexInput/Output}}) make it really > easy to track counters of how many IOPs and net bytes are read for each > query, which is a useful metric to track/aggregate/alarm on in production or > dev benchmarks. > At my day job we use these wrappers in our nightly benchmarks to catch any > accidental performance regressions. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8884) Add Directory wrapper to track per-query IO counters
[ https://issues.apache.org/jira/browse/LUCENE-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-8884: --- Status: Open (was: Open) Here's an initial patch, adding {{IOTrackingDirectoryWrapper.}} Whenever a given thread is "working" on a particular query it must first call {{setQueryForThread}} so the wrapper knows which query's counters to increment. It tracks number of IOPs and how many total bytes were read. It's likely it impacts search performance, so it should only be used during profiling/benchmarking. > Add Directory wrapper to track per-query IO counters > > > Key: LUCENE-8884 > URL: https://issues.apache.org/jira/browse/LUCENE-8884 > Project: Lucene - Core > Issue Type: Improvement > Components: core/store >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > > Lucene's IO abstractions ({{Directory, IndexInput/Output}}) make it really > easy to track counters of how many IOPs and net bytes are read for each > query, which is a useful metric to track/aggregate/alarm on in production or > dev benchmarks. > At my day job we use these wrappers in our nightly benchmarks to catch any > accidental performance regressions. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-8884) Add Directory wrapper to track per-query IO counters
[ https://issues.apache.org/jira/browse/LUCENE-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-8884: -- Assignee: Michael McCandless > Add Directory wrapper to track per-query IO counters > > > Key: LUCENE-8884 > URL: https://issues.apache.org/jira/browse/LUCENE-8884 > Project: Lucene - Core > Issue Type: Improvement > Components: core/store >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > > Lucene's IO abstractions ({{Directory, IndexInput/Output}}) make it really > easy to track counters of how many IOPs and net bytes are read for each > query, which is a useful metric to track/aggregate/alarm on in production or > dev benchmarks. > At my day job we use these wrappers in our nightly benchmarks to catch any > accidental performance regressions. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8069) Allow index sorting by field length
[ https://issues.apache.org/jira/browse/LUCENE-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881211#comment-16881211 ] Michael McCandless commented on LUCENE-8069: +1, results are impressive. > Allow index sorting by field length > --- > > Key: LUCENE-8069 > URL: https://issues.apache.org/jira/browse/LUCENE-8069 > Project: Lucene - Core > Issue Type: Wish >Reporter: Adrien Grand >Priority: Minor > > Short documents are more likely to get higher scores, so sorting an index by > field length would mean we would be likely to collect best matches first. > Depending on the similarity implementation, this might even allow to early > terminate collection of top documents on term queries. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8311) Leverage impacts for phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-8311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881203#comment-16881203 ] Michael McCandless commented on LUCENE-8311: +1 to merge ... that is a good tradeoff! Astronomical speedups for {{PhraseQuery}} and some small slowdowns in others. It's important that all of our common queries properly handle impacts. > Leverage impacts for phrase queries > --- > > Key: LUCENE-8311 > URL: https://issues.apache.org/jira/browse/LUCENE-8311 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: LUCENE-8311.patch > > Time Spent: 10m > Remaining Estimate: 0h > > Now that we expose raw impacts, we could leverage them for phrase queries. > For instance for exact phrases, we could take the minimum term frequency for > each unique norm value in order to get upper bounds of the score for the > phrase. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4312) Index format to store position length per position
[ https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881195#comment-16881195 ] Michael McCandless commented on LUCENE-4312: +1 to build on payloads, today, to break the chicken/egg situation. This should just be a {{TokenFilter}} that converts {{PositionLengthAttribute}} into payloads? Then [~mgibney] could contribute query-time code that can decode these payloads and implement correct positional queries. Once these prove useful we can circle back later and optimize how we store position lengths in the index. > Index format to store position length per position > -- > > Key: LUCENE-4312 > URL: https://issues.apache.org/jira/browse/LUCENE-4312 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 6.0 >Reporter: Gang Luo >Priority: Minor > Labels: Suggestion > Original Estimate: 72h > Remaining Estimate: 72h > > Mike Mccandless said:TokenStreams are actually graphs. > Indexer ignores PositionLengthAttribute.Need change the index format (and > Codec APIs) to store an additional int position length per position. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881120#comment-16881120 ] Michael McCandless commented on LUCENE-8753: +1 to push this new codec in Lucene, e.g. under codecs or sandbox or misc modules, if we can avoid making changes to other sources (once LUCENE-8906 is fixed). > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 3h 20m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8906) Lucene50PostingsReader.postings() casts BlockTermState param to private IntBlockTermState
[ https://issues.apache.org/jira/browse/LUCENE-8906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881109#comment-16881109 ] Michael McCandless commented on LUCENE-8906: +1 to simply make {{IntBlockTermState}} public. > Lucene50PostingsReader.postings() casts BlockTermState param to private > IntBlockTermState > - > > Key: LUCENE-8906 > URL: https://issues.apache.org/jira/browse/LUCENE-8906 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Bruno Roustant >Priority: Major > > Lucene50PostingsReader is the public API that offers the postings() method to > read the postings. Any PostingFormat can use it (as well as > Lucene50PostingsWriter) to read/write postings. > But the postings() method asks for a (public) BlockTermState param which is > internally cast to the private IntBlockTermState. This BlockTermState is > provided by Lucene50PostingsReader.newTermState(). > public PostingsEnum postings(FieldInfo fieldInfo, BlockTermState termState, > PostingsEnum reuse, int flags) > This actually makes impossible to a custom PostingFormat customizing the > Block file structure to use this postings() method by providing their > (Int)BlockTermState, because they cannot access the FP fields of the > IntBlockTermState returned by PostingsReaderBase.newTermState(). > Proposed change: > * Either make IntBlockTermState public, as well as its fields. > * Or replace it by an interface in the postings() method. In this case the > IntBlockTermState fields currently accessed directly would be replaced by > getter/setter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8878) Provide alternative sorting utility from SortField other than FieldComparator
[ https://issues.apache.org/jira/browse/LUCENE-8878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879274#comment-16879274 ] Michael McCandless commented on LUCENE-8878: {quote}I believe you are talking about Scorer#setMinCompetitiveScore, ie. changing the FieldComparator API to only track the bottom bucket as opposed to every bucket? If this is the case I agree that it sounds like a good idea to explore. {quote} Ahh, yes, that ;) +1 > Provide alternative sorting utility from SortField other than FieldComparator > - > > Key: LUCENE-8878 > URL: https://issues.apache.org/jira/browse/LUCENE-8878 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: 8.1.1 >Reporter: Tony Xu >Priority: Major > > The `FieldComparator` has many responsibilities and users get all of them at > once. At high level the main functionalities of `FieldComparator` are > * Provide LeafFieldComparator > * Allocate storage for requested number of hits > * Read the values from DocValues/Custom source etc. > * Compare two values > There are two major areas for improvement > # The logic of reading values and storing them are coupled. > # User need to specify the size in order to create a `FieldComparator` but > sometimes the size is unknown upfront. > # From `FieldComparator`'s API, one can't reason about thread-safety so it > is not suitable for concurrent search. > E.g. Can two concurrent thread use the same `FieldComparator` to call > `getLeafComparator` for two different segments they are working on? In fact, > almost all existing implementations of `FieldComparator` are not thread-safe. > The proposal is to enhance `SortField` with two APIs > # {color:#14892c}int compare(Object v1, Object v2){color} – this is to > compare two values from different docs for this field > # {color:#14892c}ValueAccessor newValueAccessor(LeafReaderContext > leaf){color} – This encapsulate the logic for obtaining the right > implementation in order to read the field values. > `ValueAccessor` should be accessed in a similar way as `DocValues` to > provide the sort value for a document in an advance & read fashion. > With this API, hopefully we can reduce the memory usage when using > `FieldComparator` because the users either store the sort values or at least > the slot number besides the storage allocated by `FieldComparator` itself. > Ideally, only once copy of the values should be stored. > The proposed API is also more friendly to concurrent search since it provides > the `ValueAccessor` per leaf. Although same `ValueAccessor` can't be shared > if there are more than one thread working on the same leaf, at least they can > initialize their own `ValueAccessor`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8781) Explore FST direct array arc encoding
[ https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875903#comment-16875903 ] Michael McCandless commented on LUCENE-8781: +1 to do the simple approach even if it costs a little performance, and to delete the unused method in {{oal.util.fst.Util}}. This is an experimental codec that implements an optional terms dict API that assigns a {{long}} ordinal to each term. > Explore FST direct array arc encoding > -- > > Key: LUCENE-8781 > URL: https://issues.apache.org/jira/browse/LUCENE-8781 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mike Sokolov >Assignee: Dawid Weiss >Priority: Major > Fix For: master (9.0), 8.2 > > Attachments: FST-2-4.png, FST-6-9.png, FST-size.png > > Time Spent: 4h 10m > Remaining Estimate: 0h > > This issue is for exploring an alternate FST encoding of Arcs as full-sized > arrays so Arcs are addressed directly by label, avoiding binary search that > we use today for arrays of Arcs. PR: > https://github.com/apache/lucene-solr/pull/657 > h3. Testing > ant test passes. I added some unit tests that were helpful in uncovering bugs > while > implementing which are more difficult to chase down when uncovered by the > randomized testing we already do. They don't really test anything new; > they're just more focused. > I'm not sure why, but ant precommit failed for me with: > {noformat} > ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls > failed while scanning class > 'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' > (SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: > info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about > referenced class 'info.ganglia.gmetric4j.gmetric.GMetric') > {noformat} > I also got Test2BFST running (it was originally timing out due to excessive > calls to ramBytesUsage(), which seems to have gotten slow), and it passed; > that change isn't include here. > h4. Micro-benchmark > I timed lookups in FST via FSTEnum.seekExact in a unit test under various > conditions. > h5. English words > A test of looking up existing words in a dictionary of ~17 English words > shows improvements; the numbers listed are % change in FST size, time to look > up (FSTEnum.seekExact) words that are in the dict, and time to look up random > strings that are not in the dict. The comparison is against the current > codebase with the optimization disabled. A separate comparison of showed no > significant change of the baseline (no opto applied) vs the current master > FST impl with no code changes applied. > || load=2|| load=4 || load=16 || > | +4, -6, -7 | +18, -11, -8 | +22, -11.5, -7 | > The "load factor" used for those measurements controls when direct array arc > encoding is used; > namely when the number of outgoing arcs was > load * (max label - min label). > h5. sequential and random terms > The same test, with terms being a sequence of integers as strings shows a > larger improvement, around 20% (load=4). This is presumably the best case for > this delta, where every Arc is encoded as a direct lookup. > When random lowercase ASCII strings are used, a smaller improvement of around > 4% is seen. > h4. luceneutil > Testing w/luceneutil (wikimediumall) we see improvements mostly in the > PKLookup case. Other results seem noisy, with perhaps a small improvment in > some of the queries. > {noformat} > TaskQPS base StdDevQPS opto StdDev > Pct diff > OrHighHigh6.93 (3.0%)6.89 (3.1%) > -0.5% ( -6% -5%) >OrHighMed 45.15 (3.9%) 44.92 (3.5%) > -0.5% ( -7% -7%) > Wildcard8.72 (4.7%)8.69 (4.6%) > -0.4% ( -9% -9%) > AndHighLow 274.11 (2.6%) 273.58 (3.1%) > -0.2% ( -5% -5%) >OrHighLow 241.41 (1.9%) 241.11 (3.5%) > -0.1% ( -5% -5%) > AndHighMed 52.23 (4.1%) 52.41 (5.3%) > 0.3% ( -8% - 10%) > MedTerm 1026.24 (3.1%) 1030.52 (4.3%) > 0.4% ( -6% -8%) > HighTerm .10 (3.4%) 1116.70 (4.0%) > 0.5% ( -6% -8%) >HighTermDayOfYearSort 14.59 (8.2%) 14.73 (9.3%) > 1.0% ( -15% - 20%) > AndHighHigh 13.45 (6.2%) 13.61 (4.4%) > 1.2% ( -8% - 12%) >HighTermMonthSort 63.09 (12.5%) 64.13 (10.9%) > 1.6% ( -19% - 28%) > LowTerm 1338.94 (3.3%) 1383.90 (5
[jira] [Commented] (LUCENE-8878) Provide alternative sorting utility from SortField other than FieldComparator
[ https://issues.apache.org/jira/browse/LUCENE-8878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874573#comment-16874573 ] Michael McCandless commented on LUCENE-8878: The recently added impacts have a similar use case, where we need to express to the {{ImpactsEnum}} what the "bottom" of our PQ is, I think? Maybe we could take inspiration from that to simplify the comparator APIs or make them similar to how {{ImpactsEnum}} does it? > Provide alternative sorting utility from SortField other than FieldComparator > - > > Key: LUCENE-8878 > URL: https://issues.apache.org/jira/browse/LUCENE-8878 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: 8.1.1 >Reporter: Tony Xu >Priority: Major > > The `FieldComparator` has many responsibilities and users get all of them at > once. At high level the main functionalities of `FieldComparator` are > * Provide LeafFieldComparator > * Allocate storage for requested number of hits > * Read the values from DocValues/Custom source etc. > * Compare two values > There are two major areas for improvement > # The logic of reading values and storing them are coupled. > # User need to specify the size in order to create a `FieldComparator` but > sometimes the size is unknown upfront. > # From `FieldComparator`'s API, one can't reason about thread-safety so it > is not suitable for concurrent search. > E.g. Can two concurrent thread use the same `FieldComparator` to call > `getLeafComparator` for two different segments they are working on? In fact, > almost all existing implementations of `FieldComparator` are not thread-safe. > The proposal is to enhance `SortField` with two APIs > # {color:#14892c}int compare(Object v1, Object v2){color} – this is to > compare two values from different docs for this field > # {color:#14892c}ValueAccessor newValueAccessor(LeafReaderContext > leaf){color} – This encapsulate the logic for obtaining the right > implementation in order to read the field values. > `ValueAccessor` should be accessed in a similar way as `DocValues` to > provide the sort value for a document in an advance & read fashion. > With this API, hopefully we can reduce the memory usage when using > `FieldComparator` because the users either store the sort values or at least > the slot number besides the storage allocated by `FieldComparator` itself. > Ideally, only once copy of the values should be stored. > The proposed API is also more friendly to concurrent search since it provides > the `ValueAccessor` per leaf. Although same `ValueAccessor` can't be shared > if there are more than one thread working on the same leaf, at least they can > initialize their own `ValueAccessor`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8884) Add Directory wrapper to track per-query IO counters
Michael McCandless created LUCENE-8884: -- Summary: Add Directory wrapper to track per-query IO counters Key: LUCENE-8884 URL: https://issues.apache.org/jira/browse/LUCENE-8884 Project: Lucene - Core Issue Type: Improvement Components: core/store Reporter: Michael McCandless Lucene's IO abstractions ({{Directory, IndexInput/Output}}) make it really easy to track counters of how many IOPs and net bytes are read for each query, which is a useful metric to track/aggregate/alarm on in production or dev benchmarks. At my day job we use these wrappers in our nightly benchmarks to catch any accidental performance regressions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8865) Use incoming thread for execution if IndexSearcher has an executor
[ https://issues.apache.org/jira/browse/LUCENE-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873293#comment-16873293 ] Michael McCandless commented on LUCENE-8865: I plan to test this using our production benchmarks ... will try to do that soon. > Use incoming thread for execution if IndexSearcher has an executor > --- > > Key: LUCENE-8865 > URL: https://issues.apache.org/jira/browse/LUCENE-8865 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Simon Willnauer >Priority: Major > Fix For: master (9.0), 8.2 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > Today we don't utilize the incoming thread for a search when IndexSearcher > has an executor. This thread is only idleing but can be used to execute a > search > once all other collectors are dispatched. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8878) Provide alternative sorting utility from SortField other than FieldComparator
[ https://issues.apache.org/jira/browse/LUCENE-8878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872270#comment-16872270 ] Michael McCandless commented on LUCENE-8878: +1 to simplify Lucene's comparator APIs – they are crazy complicated because they are "hiding" a priority queue underneath. They look nothing like you'd expect a comparator to look like! They were designed this way to sometimes enable int ordinal comparisons when sorting by string fields ({{DocValuesType.SORTED}}) but I'm not sure all that API complexity is really worth the performance. To access the values can we somehow use the existing {{FunctionValues}} classes? > Provide alternative sorting utility from SortField other than FieldComparator > - > > Key: LUCENE-8878 > URL: https://issues.apache.org/jira/browse/LUCENE-8878 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: 8.1.1 >Reporter: Tony Xu >Priority: Major > > The `FieldComparator` has many responsibilities and users get all of them at > once. At high level the main functionalities of `FieldComparator` are > * Provide LeafFieldComparator > * Allocate storage for requested number of hits > * Read the values from DocValues/Custom source etc. > * Compare two values > There are two major areas for improvement > # The logic of reading values and storing them are coupled. > # User need to specify the size in order to create a `FieldComparator` but > sometimes the size is unknown upfront. > # From `FieldComparator`'s API, one can't reason about thread-safety so it > is not suitable for concurrent search. > E.g. Can two concurrent thread use the same `FieldComparator` to call > `getLeafComparator` for two different segments they are working on? In fact, > almost all existing implementations of `FieldComparator` are not thread-safe. > The proposal is to enhance `SortField` with two APIs > # {color:#14892c}int compare(Object v1, Object v2){color} – this is to > compare two values from different docs for this field > # {color:#14892c}ValueAccessor newValueAccessor(LeafReaderContext > leaf){color} – This encapsulate the logic for obtaining the right > implementation in order to read the field values. > `ValueAccessor` should be accessed in a similar way as `DocValues` to > provide the sort value for a document in an advance & read fashion. > With this API, hopefully we can reduce the memory usage when using > `FieldComparator` because the users either store the sort values or at least > the slot number besides the storage allocated by `FieldComparator` itself. > Ideally, only once copy of the values should be stored. > The proposed API is also more friendly to concurrent search since it provides > the `ValueAccessor` per leaf. Although same `ValueAccessor` can't be shared > if there are more than one thread working on the same leaf, at least they can > initialize their own `ValueAccessor`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8867) Optimise BKD tree for low cardinality leaves
[ https://issues.apache.org/jira/browse/LUCENE-8867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867256#comment-16867256 ] Michael McCandless commented on LUCENE-8867: +1 to both of these optimizations – I suspect many use cases will have such duplicate values and we could see big reduction on index usage for the leaf blocks, and speedup if we do the comparison once per unique value instead of once per all values. > Optimise BKD tree for low cardinality leaves > > > Key: LUCENE-8867 > URL: https://issues.apache.org/jira/browse/LUCENE-8867 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ignacio Vera >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Currently if a leaf on the BKD tree contains only few values, then the leaf > is treated the same way as it all values are different. It many cases it can > be much more efficient to store the distinct values with the cardinality. > In addition, in this case the method IntersectVisitor#visit(docId, byte[]) is > called n times with the same byte array but different docID. This issue > proposes to add a new method to the interface that accepts an array of docs > so it can be override by implementors and gain search performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8854) Can we do "doc at a time scoring" from the BKD tree for exact queries?
[ https://issues.apache.org/jira/browse/LUCENE-8854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862929#comment-16862929 ] Michael McCandless commented on LUCENE-8854: {quote}moving points from a visitor API to a more cursor-style API that would allow us to walk freely the index of the KD tree. {quote} +1, that would enable exactly this kind of optimization. Maybe, it's an optional way to consume/walk the BKD tree that applies in only certain situations. > Can we do "doc at a time scoring" from the BKD tree for exact queries? > -- > > Key: LUCENE-8854 > URL: https://issues.apache.org/jira/browse/LUCENE-8854 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > > Random idea: normally our point queries must walk the BKD tree, building up a > sparse or dense bitset as a 1st pass, then in 2nd pass run the "normal" query > scorers (postings, doc values), because the docids coming out across leaf > blocks are not in docid order, like postings and doc values. > But, if the query is an exact point query, I think we tie break our within > leaf block sorts by docid, and that'd even apply across multiple leaf blocks > (if that value occurs enough times) and so for that case we could avoid the 2 > passes and do it all in one pass maybe? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8854) Can we do "doc at a time scoring" from the BKD tree for exact queries?
Michael McCandless created LUCENE-8854: -- Summary: Can we do "doc at a time scoring" from the BKD tree for exact queries? Key: LUCENE-8854 URL: https://issues.apache.org/jira/browse/LUCENE-8854 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Random idea: normally our point queries must walk the BKD tree, building up a sparse or dense bitset as a 1st pass, then in 2nd pass run the "normal" query scorers (postings, doc values), because the docids coming out across leaf blocks are not in docid order, like postings and doc values. But, if the query is an exact point query, I think we tie break our within leaf block sorts by docid, and that'd even apply across multiple leaf blocks (if that value occurs enough times) and so for that case we could avoid the 2 passes and do it all in one pass maybe? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8791) Add CollectorRescorer
[ https://issues.apache.org/jira/browse/LUCENE-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16860335#comment-16860335 ] Michael McCandless commented on LUCENE-8791: {quote}the default interface takes an ExecutorManager. {quote} Hmm did you mean {{ExecutorService}} or {{CollectorManager}} here? There is no {{ExecutorManger}} that I can see. +1 to mark the {{ExecutorService}} ctor/setter as expert w/ javadocs that explain that it is not often needed to distribute collection work across concurrent threads. > Add CollectorRescorer > - > > Key: LUCENE-8791 > URL: https://issues.apache.org/jira/browse/LUCENE-8791 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Elbek Kamoliddinov >Priority: Major > Attachments: LUCENE-8791.patch, LUCENE-8791.patch, LUCENE-8791.patch, > LUCENE-8791.patch, LUCENE-8791.patch > > > This is another implementation of query rescorer api (LUCENE-5489). It adds > rescoring functionality based on provided CollectorManager. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8823) IllegalStateException: wrong number of values added during doc values merge
Michael McCandless created LUCENE-8823: -- Summary: IllegalStateException: wrong number of values added during doc values merge Key: LUCENE-8823 URL: https://issues.apache.org/jira/browse/LUCENE-8823 Project: Lucene - Core Issue Type: Bug Affects Versions: 7.6 Reporter: Michael McCandless Here's another mysterious exception we hit in production, on Lucene 7.x snapshot release (near 7.6), OpenJDK 11: {noformat} 2019-05-31T05:49:22,443 [ERROR] (Lucene Merge Thread #0) com.amazon.lucene.util.UncaughtExceptionHandler: Uncaught exception: org.apache.lucene.index.MergePolicy$MergeException: java.lang.IllegalStateException: Wrong number of values added, expected: 97006, got: 95784 in thread Thread[Lucene Merge Thread #0,5,main] org.apache.lucene.index.MergePolicy$MergeException: java.lang.IllegalStateException: Wrong number of values added, expected: 97006, got: 95784 at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:704) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684) Caused by: java.lang.IllegalStateException: Wrong number of values added, expected: 97006, got: 95784 at org.apache.lucene.util.packed.DirectWriter.finish(DirectWriter.java:94) at org.apache.lucene.codecs.lucene70.Lucene70DocValuesConsumer.writeValuesSingleBlock(Lucene70DocValuesConsumer.java:283) at org.apache.lucene.codecs.lucene70.Lucene70DocValuesConsumer.writeValues(Lucene70DocValuesConsumer.java:263) at org.apache.lucene.codecs.lucene70.Lucene70DocValuesConsumer.addNumericField(Lucene70DocValuesConsumer.java:110) at org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:175) at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:135) at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:151) at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:182) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:126) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4438) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4060) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625) at com.amazon.lucene.index.ConcurrentMergeSchedulerWrapper.doMerge(ConcurrentMergeSchedulerWrapper.java:54) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662){noformat} Merging of a numeric doc values field failed because too few values were added. This may also be a JVM bug, though our doc values codec code is quite complex so it could also be a Lucene bug! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8822) UnsupportedOperationException: unused: not a comparsion-based sort during IndexWriter flush
Michael McCandless created LUCENE-8822: -- Summary: UnsupportedOperationException: unused: not a comparsion-based sort during IndexWriter flush Key: LUCENE-8822 URL: https://issues.apache.org/jira/browse/LUCENE-8822 Project: Lucene - Core Issue Type: Bug Affects Versions: 7.6 Reporter: Michael McCandless We hit this very strange exception in production 7.x snapshot (near 7.6), OpenJDK 11: {noformat} Caused by: java.lang.UnsupportedOperationException: unused: not a comparison-based sort at org.apache.lucene.util.MSBRadixSorter.compare(MSBRadixSorter.java:115) at org.apache.lucene.util.Sorter.siftDown(Sorter.java:235) at org.apache.lucene.util.Sorter.heapify(Sorter.java:228) at org.apache.lucene.util.MSBRadixSorter.computeCommonPrefixLengthAndBuildHistogram(MSBRadixSorter.java:209) at org.apache.lucene.util.MSBRadixSorter.radixSort(MSBRadixSorter.java:148) at org.apache.lucene.util.MSBRadixSorter.radixSort(MSBRadixSorter.java:155) at org.apache.lucene.util.MSBRadixSorter.sort(MSBRadixSorter.java:128) at org.apache.lucene.util.MSBRadixSorter.sort(MSBRadixSorter.java:121) at org.apache.lucene.util.bkd.MutablePointsReaderUtils.sort(MutablePointsReaderUtils.java:90) at org.apache.lucene.util.bkd.BKDWriter.writeField1Dim(BKDWriter.java:497) at org.apache.lucene.util.bkd.BKDWriter.writeField(BKDWriter.java:427) at org.apache.lucene.codecs.lucene60.Lucene60PointsWriter.writeField(Lucene60PointsWriter.java:105) at org.apache.lucene.index.PointValuesWriter.flush(PointValuesWriter.java:183) at org.apache.lucene.index.DefaultIndexingChain.writePoints(DefaultIndexingChain.java:206) at org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:141) at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:470) at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:554) at org.apache.lucene.index.DocumentsWriter.flushOneDWPT(DocumentsWriter.java:257) at org.apache.lucene.index.IndexWriter.flushNextBuffer(IndexWriter.java:3157) at com.amazon.lucene.index.Indexer.lambda$commit$0(Indexer.java:1129){noformat} The exception makes no sense to me: when I look at {{MSBRadixSorter.computeCommonPrefixLengthAndBuildHistogram}} at that line it does NOT invoke {{Sorter.heapify}} so I'm mystified. Maybe this is a JVM bug ... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8791) Add CollectorRescorer
[ https://issues.apache.org/jira/browse/LUCENE-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852068#comment-16852068 ] Michael McCandless commented on LUCENE-8791: {quote}In my opinion, rescorers should be used on the very top hits only. {quote} Is the concern that this rescorer optionally accepts {{ExecutorService}} to distribute the work across threads? Maybe we could just add another ctor *not* taking {{ExecutorService}} for those use cases that want to run single-threaded? > Add CollectorRescorer > - > > Key: LUCENE-8791 > URL: https://issues.apache.org/jira/browse/LUCENE-8791 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Elbek Kamoliddinov >Priority: Major > Attachments: LUCENE-8791.patch, LUCENE-8791.patch, LUCENE-8791.patch > > > This is another implementation of query rescorer api (LUCENE-5489). It adds > rescoring functionality based on provided CollectorManager. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8757) Better Segment To Thread Mapping Algorithm
[ https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844005#comment-16844005 ] Michael McCandless commented on LUCENE-8757: {quote}Your last patch sorts in reverse order of docBase, it should sort by the natural order? {quote} Hmm can we add a test case or an assertion somewhere that would fail if this happens again in the future? > Better Segment To Thread Mapping Algorithm > -- > > Key: LUCENE-8757 > URL: https://issues.apache.org/jira/browse/LUCENE-8757 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Assignee: Simon Willnauer >Priority: Major > Attachments: LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch, > LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch > > > The current segments to threads allocation algorithm always allocates one > thread per segment. This is detrimental to performance in case of skew in > segment sizes since small segments also get their dedicated thread. This can > lead to performance degradation due to context switching overheads. > > A better algorithm which is cognizant of size skew would have better > performance for realistic scenarios -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8804) FieldType attribute map should not be modifiable after freeze
[ https://issues.apache.org/jira/browse/LUCENE-8804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844003#comment-16844003 ] Michael McCandless commented on LUCENE-8804: +1 to the issue and patch, great catch, thanks [~vamshi]! > FieldType attribute map should not be modifiable after freeze > - > > Key: LUCENE-8804 > URL: https://issues.apache.org/jira/browse/LUCENE-8804 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 8.0 >Reporter: Vamshi Vijay Nakkirtha >Priority: Minor > Labels: features, patch > Attachments: LUCENE-8804.patch > > > Today FieldType attribute map can be modifiable even after freeze. For all > other properties of FieldType, we do "checkIfFrozen()" before making the > update to the property but for attribute map, we does not seem to make such > check. > > [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.0.0/lucene/core/src/java/org/apache/lucene/document/FieldType.java#L363] > we may need to add check at the beginning of the function similar to other > properties setters. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8757) Better Segment To Thread Mapping Algorithm
[ https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835502#comment-16835502 ] Michael McCandless commented on LUCENE-8757: Are the work units tackled in order for each query? I.e. is the queue a FIFO queue? If so, the sorting can be useful since {{IndexSearcher}} would work first on the hardest/slowest work units, the "long poles" for the concurrent search? > Better Segment To Thread Mapping Algorithm > -- > > Key: LUCENE-8757 > URL: https://issues.apache.org/jira/browse/LUCENE-8757 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > Attachments: LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch > > > The current segments to threads allocation algorithm always allocates one > thread per segment. This is detrimental to performance in case of skew in > segment sizes since small segments also get their dedicated thread. This can > lead to performance degradation due to context switching overheads. > > A better algorithm which is cognizant of size skew would have better > performance for realistic scenarios -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8785) TestIndexWriterDelete.testDeleteAllNoDeadlock failure
[ https://issues.apache.org/jira/browse/LUCENE-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835500#comment-16835500 ] Michael McCandless commented on LUCENE-8785: Thank you [~simonw]! Love how open-source works ;) Lucene gets better. > TestIndexWriterDelete.testDeleteAllNoDeadlock failure > - > > Key: LUCENE-8785 > URL: https://issues.apache.org/jira/browse/LUCENE-8785 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.6 > Environment: OpenJDK 1.8.0_202 >Reporter: Michael McCandless >Assignee: Simon Willnauer >Priority: Minor > Fix For: 7.7.2, master (9.0), 8.2, 8.1.1 > > Time Spent: 40m > Remaining Estimate: 0h > > I was running Lucene's core tests on an {{i3.16xlarge}} EC2 instance (64 > cores), and hit this random yet spooky failure: > {noformat} > [junit4] 2> NOTE: reproduce with: ant test > -Dtestcase=TestIndexWriterDelete -Dtests.method=testDeleteAllNoDeadLock > -Dtests.seed=952BE262BA547C1 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=ar-YE -Dtests.timezone=Europe/Lisbon -Dtests.as\ > serts=true -Dtests.file.encoding=US-ASCII > [junit4] ERROR 0.16s J3 | TestIndexWriterDelete.testDeleteAllNoDeadLock > <<< > [junit4] > Throwable #1: > com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an > uncaught exception in thread: Thread[id=36, name=Thread-2, state=RUNNABLE, > group=TGRP-TestIndexWriterDelete] > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1:3A4B5138AB66FD97]:0) > [junit4] > Caused by: java.lang.RuntimeException: > java.lang.IllegalArgumentException: field number 0 is already mapped to field > name "null", not "content" > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1]:0) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:332) > [junit4] > Caused by: java.lang.IllegalArgumentException: field number > 0 is already mapped to field name "null", not "content" > [junit4] > at > org.apache.lucene.index.FieldInfos$FieldNumbers.verifyConsistent(FieldInfos.java:310) > [junit4] > at > org.apache.lucene.index.FieldInfos$Builder.getOrAdd(FieldInfos.java:415) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.getOrAddField(DefaultIndexingChain.java:650) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:428) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394) > [junit4] > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297) > [junit4] > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450) > [junit4] > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291) > [junit4] > at > org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264) > [junit4] > at > org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:159) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:326){noformat} > It does *not* reproduce unfortunately ... but maybe there is some subtle > thread safety issue in this code ... this is a hairy part of Lucene ;) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8757) Better Segment To Thread Mapping Algorithm
[ https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835498#comment-16835498 ] Michael McCandless commented on LUCENE-8757: Whoa, fast iterations over here! I think there is an important justification for the 2nd criteria (number of segments in each work unit / slice), which is if you have an index with some large segments, and then with a long tail of small segments (easily happens if your machine has substantially CPU concurrency and you use multiple threads), since there is a fixed cost for visiting each segment, if you put too many small segments into one work unit, those fixed costs multiply and that one work unit can become too slow even though it's not actually going to visit too many documents. I think we should keep it? Re: the choice of the constants – I ran some performance tests quite a while ago on our production data/queries and a machine with sizable concurrency ({{i3.16xlarge}}) and found those two constants to be a sweet spot at the time. But let's also remember: this is simply a default segment -> work units assignment, and expert users can always continue to override. Good defaults are important ;) > Better Segment To Thread Mapping Algorithm > -- > > Key: LUCENE-8757 > URL: https://issues.apache.org/jira/browse/LUCENE-8757 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > Attachments: LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch > > > The current segments to threads allocation algorithm always allocates one > thread per segment. This is detrimental to performance in case of skew in > segment sizes since small segments also get their dedicated thread. This can > lead to performance degradation due to context switching overheads. > > A better algorithm which is cognizant of size skew would have better > performance for realistic scenarios -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8791) Add CollectorRescorer
[ https://issues.apache.org/jira/browse/LUCENE-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832327#comment-16832327 ] Michael McCandless commented on LUCENE-8791: {quote}Please have a look again at the spacing. In general, it would be good if the code was a bit more readable w.r.t spacing around braces, breaking the code into logical paragraphes. {quote} The spacing/indenting looks correct to me – it seems to match Lucene's coding guidelines ([https://wiki.apache.org/lucene-java/DeveloperTips]). > Add CollectorRescorer > - > > Key: LUCENE-8791 > URL: https://issues.apache.org/jira/browse/LUCENE-8791 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Elbek Kamoliddinov >Priority: Major > Attachments: LUCENE-8791.patch > > > This is another implementation of query rescorer api (LUCENE-5489). It adds > rescoring functionality based on provided CollectorManager. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8756) MLT queries ignore custom term frequencies
[ https://issues.apache.org/jira/browse/LUCENE-8756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832139#comment-16832139 ] Michael McCandless commented on LUCENE-8756: Ugh, sorry! Thank you [~cpoerschke]! > MLT queries ignore custom term frequencies > -- > > Key: LUCENE-8756 > URL: https://issues.apache.org/jira/browse/LUCENE-8756 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser >Affects Versions: 7.0, 7.0.1, 7.1, 7.2, 7.2.1, 7.3, 7.4, 7.3.1, 7.5, 7.6, > 7.7, 7.7.1, 8.0 >Reporter: Olli Kuonanoja >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > The MLT queries ignore any custom term frequencies for the like-texts and > uses a hard-coded frequency of 1 per occurrence. I have prepared a test-case > to demonstrate the issue and a fix proposal > https://github.com/ollik1/lucene-solr/commit/9dbbce2af26698cec1ac82a526d9cee60a880678 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8790) Spooky exception merging doc values
Michael McCandless created LUCENE-8790: -- Summary: Spooky exception merging doc values Key: LUCENE-8790 URL: https://issues.apache.org/jira/browse/LUCENE-8790 Project: Lucene - Core Issue Type: Bug Components: core/index Affects Versions: 7.5 Environment: We are on a Lucene 7.x snapshot, githash 935b0c89c6ecb446d7f05d938207760cd64bcd04, using the default Codec, with a static sort. Reporter: Michael McCandless We hit this exciting exception; we don't have a test case reproducing it, and staring at the code, I don't see how we can hit a {{NullPointerException}} on this line: {noformat} [May 2, 2019, 7:24 PM] Barrowman, Adam: 2019-05-02T18:32:10,561 [ERROR] (Lucene Merge Thread #1) com.amazon.lucene.util.UncaughtExceptionHandler: Uncaught exception: org.apache.lucene.index.MergePolicy$MergeException: java.lang.NullPointerException in thread Thread[Lucene Merge Thread #1,5,main] org.apache.lucene.index.MergePolicy$MergeException: java.lang.NullPointerException at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:704) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684) Caused by: java.lang.NullPointerException at org.apache.lucene.codecs.lucene70.Lucene70DocValuesConsumer.writeValuesSingleBlock(Lucene70DocValuesConsumer.java:279) at org.apache.lucene.codecs.lucene70.Lucene70DocValuesConsumer.writeValues(Lucene70DocValuesConsumer.java:263) at org.apache.lucene.codecs.lucene70.Lucene70DocValuesConsumer.addSortedNumericField(Lucene70DocValuesConsumer.java:536) at org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:371) at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:143) at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:151) at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:182) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:126) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4438) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4060) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625) at com.amazon.lucene.index.ConcurrentMergeSchedulerWrapper.doMerge(ConcurrentMergeSchedulerWrapper.java:54) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662) {noformat} It seems like the {{encode.get(v)}} somehow returned null, which should not happen as long as the values we iterated from the {{SortedNumericValues}} were the same up above (in {{writeValues}}) and in {{writeValuesSingleBlock}}. Confused... Note that we are using a 7.x snapshot, so it is possible this was a bug in 7.x at that time, fixed before the next 7.x release though when I compare the affected code against 8.x backwards codec, it looks the same. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8785) TestIndexWriterDelete.testDeleteAllNoDeadlock failure
[ https://issues.apache.org/jira/browse/LUCENE-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832036#comment-16832036 ] Michael McCandless commented on LUCENE-8785: {quote}If there is another thread coming in after we locked the existent threadstates we just issue a new one. {quote} Yuck :( {quote}I think we can just do what deleteAll() does today except of not dropping the schema on the floor? {quote} The thing is, I think erasing schema while under transaction is a useful feature of Lucene. I realize neither ES nor Solr expose deleteAll but I don't think that's a valid argument to remove it from Lucene ;) {quote}I want to understand the usecase for this. I can see how somebody wants to drop all docs but basically droping all IW state on the floor is difficult in my eyes. {quote} Well, imagine a user searching documents with diverse/varying fields, maybe arriving from an external (not controlled by the developer) source. And for some reason the index is reset once per week, but the devs want to allow searching of the old index while the new index is (slowly) built up. But if something goes badly wrong, they need to be able to rollback (the {{deleteAll}} and all subsequently added docs) to the last commit and try again later. If instead it succeeds, then a refresh/commit will switch to the new index atomically. > TestIndexWriterDelete.testDeleteAllNoDeadlock failure > - > > Key: LUCENE-8785 > URL: https://issues.apache.org/jira/browse/LUCENE-8785 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.6 > Environment: OpenJDK 1.8.0_202 >Reporter: Michael McCandless >Priority: Minor > > I was running Lucene's core tests on an {{i3.16xlarge}} EC2 instance (64 > cores), and hit this random yet spooky failure: > {noformat} > [junit4] 2> NOTE: reproduce with: ant test > -Dtestcase=TestIndexWriterDelete -Dtests.method=testDeleteAllNoDeadLock > -Dtests.seed=952BE262BA547C1 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=ar-YE -Dtests.timezone=Europe/Lisbon -Dtests.as\ > serts=true -Dtests.file.encoding=US-ASCII > [junit4] ERROR 0.16s J3 | TestIndexWriterDelete.testDeleteAllNoDeadLock > <<< > [junit4] > Throwable #1: > com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an > uncaught exception in thread: Thread[id=36, name=Thread-2, state=RUNNABLE, > group=TGRP-TestIndexWriterDelete] > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1:3A4B5138AB66FD97]:0) > [junit4] > Caused by: java.lang.RuntimeException: > java.lang.IllegalArgumentException: field number 0 is already mapped to field > name "null", not "content" > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1]:0) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:332) > [junit4] > Caused by: java.lang.IllegalArgumentException: field number > 0 is already mapped to field name "null", not "content" > [junit4] > at > org.apache.lucene.index.FieldInfos$FieldNumbers.verifyConsistent(FieldInfos.java:310) > [junit4] > at > org.apache.lucene.index.FieldInfos$Builder.getOrAdd(FieldInfos.java:415) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.getOrAddField(DefaultIndexingChain.java:650) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:428) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394) > [junit4] > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297) > [junit4] > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450) > [junit4] > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291) > [junit4] > at > org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264) > [junit4] > at > org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:159) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:326){noformat} > It does *not* reproduce unfortunately ... but maybe there is some subtle > thread safety issue in this code ... this is a hairy part of Lucene ;) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6687) MLT term frequency calculation bug
[ https://issues.apache.org/jira/browse/LUCENE-6687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832027#comment-16832027 ] Michael McCandless commented on LUCENE-6687: OK thanks [~teofili] – I'll backport this soon. > MLT term frequency calculation bug > -- > > Key: LUCENE-6687 > URL: https://issues.apache.org/jira/browse/LUCENE-6687 > Project: Lucene - Core > Issue Type: Bug > Components: core/query/scoring, core/queryparser >Affects Versions: 5.2.1, 6.0 > Environment: OS X v10.10.4; Solr 5.2.1 >Reporter: Marko Bonaci >Assignee: Tommaso Teofili >Priority: Major > Fix For: 5.2.2, master (9.0) > > Attachments: LUCENE-6687.patch, LUCENE-6687.patch, LUCENE-6687.patch, > LUCENE-6687.patch, buggy-method-usage.png, > solr-mlt-tf-doubling-bug-results.png, > solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png, > solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png, > solr-mlt-tf-doubling-bug.png, terms-accumulator.png, terms-angry.png, > terms-glass.png, terms-how.png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > In {{org.apache.lucene.queries.mlt.MoreLikeThis}}, there's a method > {{retrieveTerms}} that receives a {{Map}} of fields, i.e. a document > basically, but it doesn't have to be an existing doc. > !solr-mlt-tf-doubling-bug.png|height=500! > There are 2 for loops, one inside the other, which both loop through the same > set of fields. > That effectively doubles the term frequency for all the terms from fields > that we provide in MLT QP {{qf}} parameter. > It basically goes two times over the list of fields and accumulates the term > frequencies from all fields into {{termFreqMap}}. > The private method {{retrieveTerms}} is only called from one public method, > the version of overloaded method {{like}} that receives a Map: so that > private class member {{fieldNames}} is always derived from > {{retrieveTerms}}'s argument {{fields}}. > > Uh, I don't understand what I wrote myself, but that basically means that, by > the time {{retrieveTerms}} method gets called, its parameter fields and > private member {{fieldNames}} always contain the same list of fields. > Here's the proof: > These are the final results of the calculation: > !solr-mlt-tf-doubling-bug-results.png|height=700! > And this is the actual {{thread_id:TID0009}} document, where those values > were derived from (from fields {{title_mlt}} and {{pagetext_mlt}}): > !terms-glass.png|height=100! > !terms-angry.png|height=100! > !terms-how.png|height=100! > !terms-accumulator.png|height=100! > Now, let's further test this hypothesis by seeing MLT QP in action from the > AdminUI. > Let's try to find docs that are More Like doc {{TID0009}}. > Here's the interesting part, the query: > {code} > q={!mlt qf=pagetext_mlt,title_mlt mintf=14 mindf=2 minwl=3 maxwl=15}TID0009 > {code} > We just saw, in the last image above, that the term accumulator appears {{7}} > times in {{TID0009}} doc, but the {{accumulator}}'s TF was calculated as > {{14}}. > By using {{mintf=14}}, we say that, when calculating similarity, we don't > want to consider terms that appear less than 14 times (when terms from fields > {{title_mlt}} and {{pagetext_mlt}} are merged together) in {{TID0009}}. > I added the term accumulator in only one other document ({{TID0004}}), where > it appears only once, in the field {{title_mlt}}. > !solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png|height=500! > Let's see what happens when we use {{mintf=15}}: > !solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png|height=500! > I should probably mention that multiple fields ({{qf}}) work because I > applied the patch: > [SOLR-7143|https://issues.apache.org/jira/browse/SOLR-7143]. > Bug, no? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8785) TestIndexWriterDelete.testDeleteAllNoDeadlock failure
[ https://issues.apache.org/jira/browse/LUCENE-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831618#comment-16831618 ] Michael McCandless commented on LUCENE-8785: But at the point we call {{clear()}} haven't we already blocked all indexing threads? I also dislike {{deleteAll()}} and you're right a user could deleteByQuery using MatchAllDocsQuery; can we make that close-ish as efficient as {{deleteAll()}} is today? Though indeed that would preserve the schema, while {{deleteAll()}} let's you delete docs, delete schema, all under transaction (the change is not visible until commit). I'm torn on just removing that ... > TestIndexWriterDelete.testDeleteAllNoDeadlock failure > - > > Key: LUCENE-8785 > URL: https://issues.apache.org/jira/browse/LUCENE-8785 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.6 > Environment: OpenJDK 1.8.0_202 >Reporter: Michael McCandless >Priority: Minor > > I was running Lucene's core tests on an {{i3.16xlarge}} EC2 instance (64 > cores), and hit this random yet spooky failure: > {noformat} > [junit4] 2> NOTE: reproduce with: ant test > -Dtestcase=TestIndexWriterDelete -Dtests.method=testDeleteAllNoDeadLock > -Dtests.seed=952BE262BA547C1 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=ar-YE -Dtests.timezone=Europe/Lisbon -Dtests.as\ > serts=true -Dtests.file.encoding=US-ASCII > [junit4] ERROR 0.16s J3 | TestIndexWriterDelete.testDeleteAllNoDeadLock > <<< > [junit4] > Throwable #1: > com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an > uncaught exception in thread: Thread[id=36, name=Thread-2, state=RUNNABLE, > group=TGRP-TestIndexWriterDelete] > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1:3A4B5138AB66FD97]:0) > [junit4] > Caused by: java.lang.RuntimeException: > java.lang.IllegalArgumentException: field number 0 is already mapped to field > name "null", not "content" > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1]:0) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:332) > [junit4] > Caused by: java.lang.IllegalArgumentException: field number > 0 is already mapped to field name "null", not "content" > [junit4] > at > org.apache.lucene.index.FieldInfos$FieldNumbers.verifyConsistent(FieldInfos.java:310) > [junit4] > at > org.apache.lucene.index.FieldInfos$Builder.getOrAdd(FieldInfos.java:415) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.getOrAddField(DefaultIndexingChain.java:650) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:428) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394) > [junit4] > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297) > [junit4] > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450) > [junit4] > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291) > [junit4] > at > org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264) > [junit4] > at > org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:159) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:326){noformat} > It does *not* reproduce unfortunately ... but maybe there is some subtle > thread safety issue in this code ... this is a hairy part of Lucene ;) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8785) TestIndexWriterDelete.testDeleteAllNoDeadlock failure
[ https://issues.apache.org/jira/browse/LUCENE-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-8785: --- Environment: OpenJDK 1.8.0_202 (was: OpenJDK 11) > TestIndexWriterDelete.testDeleteAllNoDeadlock failure > - > > Key: LUCENE-8785 > URL: https://issues.apache.org/jira/browse/LUCENE-8785 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.6 > Environment: OpenJDK 1.8.0_202 >Reporter: Michael McCandless >Priority: Minor > > I was running Lucene's core tests on an {{i3.16xlarge}} EC2 instance (64 > cores), and hit this random yet spooky failure: > {noformat} > [junit4] 2> NOTE: reproduce with: ant test > -Dtestcase=TestIndexWriterDelete -Dtests.method=testDeleteAllNoDeadLock > -Dtests.seed=952BE262BA547C1 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=ar-YE -Dtests.timezone=Europe/Lisbon -Dtests.as\ > serts=true -Dtests.file.encoding=US-ASCII > [junit4] ERROR 0.16s J3 | TestIndexWriterDelete.testDeleteAllNoDeadLock > <<< > [junit4] > Throwable #1: > com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an > uncaught exception in thread: Thread[id=36, name=Thread-2, state=RUNNABLE, > group=TGRP-TestIndexWriterDelete] > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1:3A4B5138AB66FD97]:0) > [junit4] > Caused by: java.lang.RuntimeException: > java.lang.IllegalArgumentException: field number 0 is already mapped to field > name "null", not "content" > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1]:0) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:332) > [junit4] > Caused by: java.lang.IllegalArgumentException: field number > 0 is already mapped to field name "null", not "content" > [junit4] > at > org.apache.lucene.index.FieldInfos$FieldNumbers.verifyConsistent(FieldInfos.java:310) > [junit4] > at > org.apache.lucene.index.FieldInfos$Builder.getOrAdd(FieldInfos.java:415) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.getOrAddField(DefaultIndexingChain.java:650) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:428) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394) > [junit4] > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297) > [junit4] > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450) > [junit4] > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291) > [junit4] > at > org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264) > [junit4] > at > org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:159) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:326){noformat} > It does *not* reproduce unfortunately ... but maybe there is some subtle > thread safety issue in this code ... this is a hairy part of Lucene ;) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8785) TestIndexWriterDelete.testDeleteAllNoDeadlock failure
Michael McCandless created LUCENE-8785: -- Summary: TestIndexWriterDelete.testDeleteAllNoDeadlock failure Key: LUCENE-8785 URL: https://issues.apache.org/jira/browse/LUCENE-8785 Project: Lucene - Core Issue Type: Bug Components: core/index Affects Versions: 7.6 Environment: OpenJDK 11 Reporter: Michael McCandless I was running Lucene's core tests on an {{i3.16xlarge}} EC2 instance (64 cores), and hit this random yet spooky failure: {noformat} [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexWriterDelete -Dtests.method=testDeleteAllNoDeadLock -Dtests.seed=952BE262BA547C1 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=ar-YE -Dtests.timezone=Europe/Lisbon -Dtests.as\ serts=true -Dtests.file.encoding=US-ASCII [junit4] ERROR 0.16s J3 | TestIndexWriterDelete.testDeleteAllNoDeadLock <<< [junit4] > Throwable #1: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=36, name=Thread-2, state=RUNNABLE, group=TGRP-TestIndexWriterDelete] [junit4] > at __randomizedtesting.SeedInfo.seed([952BE262BA547C1:3A4B5138AB66FD97]:0) [junit4] > Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException: field number 0 is already mapped to field name "null", not "content" [junit4] > at __randomizedtesting.SeedInfo.seed([952BE262BA547C1]:0) [junit4] > at org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:332) [junit4] > Caused by: java.lang.IllegalArgumentException: field number 0 is already mapped to field name "null", not "content" [junit4] > at org.apache.lucene.index.FieldInfos$FieldNumbers.verifyConsistent(FieldInfos.java:310) [junit4] > at org.apache.lucene.index.FieldInfos$Builder.getOrAdd(FieldInfos.java:415) [junit4] > at org.apache.lucene.index.DefaultIndexingChain.getOrAddField(DefaultIndexingChain.java:650) [junit4] > at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:428) [junit4] > at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394) [junit4] > at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297) [junit4] > at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450) [junit4] > at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291) [junit4] > at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264) [junit4] > at org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:159) [junit4] > at org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:326){noformat} It does *not* reproduce unfortunately ... but maybe there is some subtle thread safety issue in this code ... this is a hairy part of Lucene ;) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6687) MLT term frequency calculation bug
[ https://issues.apache.org/jira/browse/LUCENE-6687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830448#comment-16830448 ] Michael McCandless commented on LUCENE-6687: Hmm it looks like this change was not backported to 8.x – was that intentional? I'm having trouble backporting LUCENE-8756 because of this ... if it was unintentional, I'll just backport this change first. Why do we show Fix Version 5.2.2? Was it really backported to 5.2.x branch? > MLT term frequency calculation bug > -- > > Key: LUCENE-6687 > URL: https://issues.apache.org/jira/browse/LUCENE-6687 > Project: Lucene - Core > Issue Type: Bug > Components: core/query/scoring, core/queryparser >Affects Versions: 5.2.1, 6.0 > Environment: OS X v10.10.4; Solr 5.2.1 >Reporter: Marko Bonaci >Assignee: Tommaso Teofili >Priority: Major > Fix For: 5.2.2, master (9.0) > > Attachments: LUCENE-6687.patch, LUCENE-6687.patch, LUCENE-6687.patch, > LUCENE-6687.patch, buggy-method-usage.png, > solr-mlt-tf-doubling-bug-results.png, > solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png, > solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png, > solr-mlt-tf-doubling-bug.png, terms-accumulator.png, terms-angry.png, > terms-glass.png, terms-how.png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > In {{org.apache.lucene.queries.mlt.MoreLikeThis}}, there's a method > {{retrieveTerms}} that receives a {{Map}} of fields, i.e. a document > basically, but it doesn't have to be an existing doc. > !solr-mlt-tf-doubling-bug.png|height=500! > There are 2 for loops, one inside the other, which both loop through the same > set of fields. > That effectively doubles the term frequency for all the terms from fields > that we provide in MLT QP {{qf}} parameter. > It basically goes two times over the list of fields and accumulates the term > frequencies from all fields into {{termFreqMap}}. > The private method {{retrieveTerms}} is only called from one public method, > the version of overloaded method {{like}} that receives a Map: so that > private class member {{fieldNames}} is always derived from > {{retrieveTerms}}'s argument {{fields}}. > > Uh, I don't understand what I wrote myself, but that basically means that, by > the time {{retrieveTerms}} method gets called, its parameter fields and > private member {{fieldNames}} always contain the same list of fields. > Here's the proof: > These are the final results of the calculation: > !solr-mlt-tf-doubling-bug-results.png|height=700! > And this is the actual {{thread_id:TID0009}} document, where those values > were derived from (from fields {{title_mlt}} and {{pagetext_mlt}}): > !terms-glass.png|height=100! > !terms-angry.png|height=100! > !terms-how.png|height=100! > !terms-accumulator.png|height=100! > Now, let's further test this hypothesis by seeing MLT QP in action from the > AdminUI. > Let's try to find docs that are More Like doc {{TID0009}}. > Here's the interesting part, the query: > {code} > q={!mlt qf=pagetext_mlt,title_mlt mintf=14 mindf=2 minwl=3 maxwl=15}TID0009 > {code} > We just saw, in the last image above, that the term accumulator appears {{7}} > times in {{TID0009}} doc, but the {{accumulator}}'s TF was calculated as > {{14}}. > By using {{mintf=14}}, we say that, when calculating similarity, we don't > want to consider terms that appear less than 14 times (when terms from fields > {{title_mlt}} and {{pagetext_mlt}} are merged together) in {{TID0009}}. > I added the term accumulator in only one other document ({{TID0004}}), where > it appears only once, in the field {{title_mlt}}. > !solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png|height=500! > Let's see what happens when we use {{mintf=15}}: > !solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png|height=500! > I should probably mention that multiple fields ({{qf}}) work because I > applied the patch: > [SOLR-7143|https://issues.apache.org/jira/browse/SOLR-7143]. > Bug, no? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8756) MLT queries ignore custom term frequencies
[ https://issues.apache.org/jira/browse/LUCENE-8756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830206#comment-16830206 ] Michael McCandless commented on LUCENE-8756: Great, thanks [~ollik1] – I'll push soon. > MLT queries ignore custom term frequencies > -- > > Key: LUCENE-8756 > URL: https://issues.apache.org/jira/browse/LUCENE-8756 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser >Affects Versions: 7.0, 7.0.1, 7.1, 7.2, 7.2.1, 7.3, 7.4, 7.3.1, 7.5, 7.6, > 7.7, 7.7.1, 8.0 >Reporter: Olli Kuonanoja >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > The MLT queries ignore any custom term frequencies for the like-texts and > uses a hard-coded frequency of 1 per occurrence. I have prepared a test-case > to demonstrate the issue and a fix proposal > https://github.com/ollik1/lucene-solr/commit/9dbbce2af26698cec1ac82a526d9cee60a880678 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8708) Can we simplify conjunctions of range queries automatically?
[ https://issues.apache.org/jira/browse/LUCENE-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830184#comment-16830184 ] Michael McCandless commented on LUCENE-8708: Hmm why do we need the {{PointRangeQuery.ToStringInterface}}? Also, why did we need to comment on that one test case – {{testInvalidPointLength}}? > Can we simplify conjunctions of range queries automatically? > > > Key: LUCENE-8708 > URL: https://issues.apache.org/jira/browse/LUCENE-8708 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: interval_range_clauses_merging0704.patch > > > BooleanQuery#rewrite already has some logic to make queries more efficient, > such as deduplicating filters or rewriting boolean queries that wrap a single > positive clause to that clause. > It would be nice to also simplify conjunctions of range queries, so that eg. > {{foo: [5 TO *] AND foo:[* TO 20]}} would be rewritten to {{foo:[5 TO 20]}}. > When constructing queries manually or via the classic query parser, it feels > unnecessary as this is something that the user can fix easily. However if you > want to implement a query parser that only allows specifying one bound at > once, such as Gmail ({{after:2018-12-31}} > https://support.google.com/mail/answer/7190?hl=en) or GitHub > ({{updated:>=2018-12-31}} > https://help.github.com/en/articles/searching-issues-and-pull-requests#search-by-when-an-issue-or-pull-request-was-created-or-last-updated) > then you might end up with inefficient queries if the end user specifies > both an upper and a lower bound. It would be nice if we optimized those > automatically. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8757) Better Segment To Thread Mapping Algorithm
[ https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830173#comment-16830173 ] Michael McCandless commented on LUCENE-8757: Thanks [~atris] – I agree it's important to have better defaults for how we coalesce segments into per-query-per-thread work units. A few small comments: * Can you insert {{_}} in the big number constants (e.g. {{2500}})? Makes it easier to read, and open-source code is written for reading :) * I think something is wrong with {{docSum}} – you only set it, and never add to it? I think the intention is to sum up docs in multiple adjacent (sorted by {{maxDoc}}) segments until that count exceeds {{2500}}? * How did you pick {{2500}} and {{100}} as good constants? We are using much smaller values in our production infrastructure – {{250_000}} and {{5}}, admittedly after only a little experimentation. * Can you add some tests? You can maybe make the slice method a package private static method and then create test cases with "interesting" {{LeafReaderContext}} combinations? In particular, a test case exposing the {{docSum}} bug would be great, then fix that bug, then see the test case pass. > Better Segment To Thread Mapping Algorithm > -- > > Key: LUCENE-8757 > URL: https://issues.apache.org/jira/browse/LUCENE-8757 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > Attachments: LUCENE-8757.patch > > > The current segments to threads allocation algorithm always allocates one > thread per segment. This is detrimental to performance in case of skew in > segment sizes since small segments also get their dedicated thread. This can > lead to performance degradation due to context switching overheads. > > A better algorithm which is cognizant of size skew would have better > performance for realistic scenarios -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8756) MLT queries ignore custom term frequencies
[ https://issues.apache.org/jira/browse/LUCENE-8756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830139#comment-16830139 ] Michael McCandless commented on LUCENE-8756: The change looks good – I left a couple minor comments – kinda freaky how Jira now tracks and posts how long I spend looking at a GitHub PR ;) Thanks [~ollik1]. > MLT queries ignore custom term frequencies > -- > > Key: LUCENE-8756 > URL: https://issues.apache.org/jira/browse/LUCENE-8756 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser >Affects Versions: 7.0, 7.0.1, 7.1, 7.2, 7.2.1, 7.3, 7.4, 7.3.1, 7.5, 7.6, > 7.7, 7.7.1, 8.0 >Reporter: Olli Kuonanoja >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > The MLT queries ignore any custom term frequencies for the like-texts and > uses a hard-coded frequency of 1 per occurrence. I have prepared a test-case > to demonstrate the issue and a fix proposal > https://github.com/ollik1/lucene-solr/commit/9dbbce2af26698cec1ac82a526d9cee60a880678 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8783) Add FST Offheap for non-default Codecs
[ https://issues.apache.org/jira/browse/LUCENE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830070#comment-16830070 ] Michael McCandless commented on LUCENE-8783: +1 > Add FST Offheap for non-default Codecs > -- > > Key: LUCENE-8783 > URL: https://issues.apache.org/jira/browse/LUCENE-8783 > Project: Lucene - Core > Issue Type: New Feature > Components: core/FSTs >Reporter: Ankit Jain >Priority: Major > Fix For: 8.0, 8.x, master (9.0) > > > Even though, LUCENE-8635 and LUCENE-8671 adds support to keep FST offheap for > default codec, there are many other codecs which do not support FST offheap. > Few examples are below: > * CompletionPostingsFormat > * BlockTreeOrdsPostingsFormat > * IDVersionPostingsFormat -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose
[ https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829561#comment-16829561 ] Michael McCandless commented on LUCENE-8776: [~venkat11] I'm sorry this change broke your use case. I think allowing backwards offsets was an accidental but longstanding bug in prior versions of Lucene. It is unfortunate your code came to rely on that bug, but we need to be able to fix our bugs and move forwards. [~mgibney] a 3rd option in your list would be for [~venkat11] to fix his query parser to properly consume the graph, and generate fully accurate queries, the way Lucene's query parsers now do. Then you can have precisely matching queries, no bugs. > Start offset going backwards has a legitimate purpose > - > > Key: LUCENE-8776 > URL: https://issues.apache.org/jira/browse/LUCENE-8776 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 7.6 >Reporter: Ram Venkat >Priority: Major > > Here is the use case where startOffset can go backwards: > Say there is a line "Organic light-emitting-diode glows", and I want to run > span queries and highlight them properly. > During index time, light-emitting-diode is split into three words, which > allows me to search for 'light', 'emitting' and 'diode' individually. The > three words occupy adjacent positions in the index, as 'light' adjacent to > 'emitting' and 'light' at a distance of two words from 'diode' need to match > this word. So, the order of words after splitting are: Organic, light, > emitting, diode, glows. > But, I also want to search for 'organic' being adjacent to > 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. > The way I solved this was to also generate 'light-emitting-diode' at two > positions: (a) In the same position as 'light' and (b) in the same position > as 'glows', like below: > ||organic||light||emitting||diode||glows|| > | |light-emitting-diode| |light-emitting-diode| | > |0|1|2|3|4| > The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets > are obviously the same. This works beautifully in Lucene 5.x in both > searching and highlighting with span queries. > But when I try this in Lucene 7.6, it hits the condition "Offsets must not go > backwards" at DefaultIndexingChain:818. This IllegalArgumentException is > being thrown without any comments on why this check is needed. As I explained > above, startOffset going backwards is perfectly valid, to deal with word > splitting and span operations on these specialized use cases. On the other > hand, it is not clear what value is added by this check and which highlighter > code is affected by offsets going backwards. This same check is done at > BaseTokenStreamTestCase:245. > I see others talk about how this check found bugs in WordDelimiter etc. but > it also prevents legitimate use cases. Can this check be removed? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8756) MLT queries ignore custom term frequencies
[ https://issues.apache.org/jira/browse/LUCENE-8756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829556#comment-16829556 ] Michael McCandless commented on LUCENE-8756: Ahh thanks for the ping [~ollik1] I agree we need to fix this; I'll have a look at the PR, thanks! > MLT queries ignore custom term frequencies > -- > > Key: LUCENE-8756 > URL: https://issues.apache.org/jira/browse/LUCENE-8756 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser >Affects Versions: 7.0, 7.0.1, 7.1, 7.2, 7.2.1, 7.3, 7.4, 7.3.1, 7.5, 7.6, > 7.7, 7.7.1, 8.0 >Reporter: Olli Kuonanoja >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The MLT queries ignore any custom term frequencies for the like-texts and > uses a hard-coded frequency of 1 per occurrence. I have prepared a test-case > to demonstrate the issue and a fix proposal > https://github.com/ollik1/lucene-solr/commit/9dbbce2af26698cec1ac82a526d9cee60a880678 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose
[ https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825332#comment-16825332 ] Michael McCandless commented on LUCENE-8776: I think your use case can be properly handled as a token graph, without offsets going backwards, if you set proper {{PositionLengthAttribute}} for each token; indeed it's for exactly cases like this that we added {{PositionLengthAttribute}}. Give your {{light-emitting-diode}} token {{PositionLengthAttribute=3}} so that the consumer of the tokens knows it spans over the three separate tokens ({{light}}, {{emitting}} and {{diode}}). To get correct behavior you must do this analysis at query time, and Lucene's query parsers will properly interpret the resulting graph and query the index correctly. Unfortunately, you cannot properly index a token graph: Lucene discards the {{PositionLengthAttribute}} which is why if you really want to index a token graph you should insert a {{FlattenGraphFilter}} at the end of your chain. This still discards information (loses the graph-ness) but tries to do so minimizing how queries are broken. > Start offset going backwards has a legitimate purpose > - > > Key: LUCENE-8776 > URL: https://issues.apache.org/jira/browse/LUCENE-8776 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 7.6 >Reporter: Ram Venkat >Priority: Major > > Here is the use case where startOffset can go backwards: > Say there is a line "Organic light-emitting-diode glows", and I want to run > span queries and highlight them properly. > During index time, light-emitting-diode is split into three words, which > allows me to search for 'light', 'emitting' and 'diode' individually. The > three words occupy adjacent positions in the index, as 'light' adjacent to > 'emitting' and 'light' at a distance of two words from 'diode' need to match > this word. So, the order of words after splitting are: Organic, light, > emitting, diode, glows. > But, I also want to search for 'organic' being adjacent to > 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. > The way I solved this was to also generate 'light-emitting-diode' at two > positions: (a) In the same position as 'light' and (b) in the same position > as 'glows', like below: > ||organic||light||emitting||diode||glows|| > | |light-emitting-diode| |light-emitting-diode| | > |0|1|2|3|4| > The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets > are obviously the same. This works beautifully in Lucene 5.x in both > searching and highlighting with span queries. > But when I try this in Lucene 7.6, it hits the condition "Offsets must not go > backwards" at DefaultIndexingChain:818. This IllegalArgumentException is > being thrown without any comments on why this check is needed. As I explained > above, startOffset going backwards is perfectly valid, to deal with word > splitting and span operations on these specialized use cases. On the other > hand, it is not clear what value is added by this check and which highlighter > code is affected by offsets going backwards. This same check is done at > BaseTokenStreamTestCase:245. > I see others talk about how this check found bugs in WordDelimiter etc. but > it also prevents legitimate use cases. Can this check be removed? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809218#comment-16809218 ] Michael McCandless commented on LUCENE-8753: I think this is similar to the terms dictionary format Lucene used to have before {{BlockTree}}, still in Lucene's sources as {{BlockTermsReader/Writer}}. Terms are assigned to fixed sized blocks and only the minimum unique prefix needs to be enrolled in the terms index FST. But being able to do binary search within a block is unique! That's very cool. It's curious you see gains e.g. for {{AndHighLow}} – are you also doing something different to encode/decode postings (not just terms dictionary)? The 500K docs is a little small – can you post results on the full {{wikimediumall}} set of documents? > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809188#comment-16809188 ] Michael McCandless commented on LUCENE-8753: {quote}I think PKLookup should be disregarded until it's fixed: [https://github.com/mikemccand/luceneutil/issues/35] (feel free to comment there if people have opinions) {quote} Note that the title on that issue was misleading (backwards from the reality) – I just corrected it. I don't think we should disregard {{PKLookup}} results: it's reporting the performance when looking up actual IDs that do exist in the index. That is an interesting result, but it is odd you are seeing varying/inconsistent results. Note that if you add {{-jira}} into the luceneutil benchmark command-line it will print results using the markup that Jira displays as a table, making it easier for everyone to read. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8740) AssertionError FlattenGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-8740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16804188#comment-16804188 ] Michael McCandless commented on LUCENE-8740: Maybe a dup of https://issues.apache.org/jira/browse/LUCENE-8723? > AssertionError FlattenGraphFilter > - > > Key: LUCENE-8740 > URL: https://issues.apache.org/jira/browse/LUCENE-8740 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 7.5, 8.0 >Reporter: Markus Jelsma >Priority: Major > Fix For: 8.1, master (9.0) > > Attachments: LUCENE-8740.patch > > > Our unit tests picked up an unusual AssertionError in FlattenGraphFilter > which manifests itself only in very specific circumstances involving > WordDelimiterGraph, StopFilter, FlattenGraphFilter and MinhashFilter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8150) Remove references to segments.gen.
[ https://issues.apache.org/jira/browse/LUCENE-8150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16796199#comment-16796199 ] Michael McCandless commented on LUCENE-8150: +1 > Remove references to segments.gen. > -- > > Key: LUCENE-8150 > URL: https://issues.apache.org/jira/browse/LUCENE-8150 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.1, master (9.0) > > Attachments: LUCENE-8150.patch, LUCENE-8150.patch > > > This was the way we wrote pending segment files before we switch to > {{pending_segments_N}} in LUCENE-5925. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8150) Remove references to segments.gen.
[ https://issues.apache.org/jira/browse/LUCENE-8150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794558#comment-16794558 ] Michael McCandless commented on LUCENE-8150: Hi [~jpountz], I fixed one issue with [http://jirasearch.mikemccandless.com|http://jirasearch.mikemccandless.com/], namely that it was incorrectly using strike-through for the issue id for issues that had "Status: PATCH AVAILABLE". For example, this issue is no longer rendered with strikethrough. Note that you can drill down on two different status if you hold the shift key when clicking on them; for example here are all issues that are Open, Reopened or Patch Available: [http://jirasearch.mikemccandless.com/search.py?chg=ddm&text=&a1=status&a2=Patch+Available&sort=recentlyUpdated&format=list&dd=project%3ALucene&dd=status%3AOpen%2CReopened] > Remove references to segments.gen. > -- > > Key: LUCENE-8150 > URL: https://issues.apache.org/jira/browse/LUCENE-8150 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.1, master (9.0) > > Attachments: LUCENE-8150.patch > > > This was the way we wrote pending segment files before we switch to > {{pending_segments_N}} in LUCENE-5925. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8150) Remove references to segments.gen.
[ https://issues.apache.org/jira/browse/LUCENE-8150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793599#comment-16793599 ] Michael McCandless commented on LUCENE-8150: {quote}I think it's due to the fact that I'm always filtering by open issues on jirasearch, and it filters out issues that are marked as "patch available" {quote} Oh no! Sorry :) I will try to fix this. Clearly [http://jirasearch.mikemccandless.com|http://jirasearch.mikemccandless.com/] is buggy here ... it seems to think issues that have patches are resolved? > Remove references to segments.gen. > -- > > Key: LUCENE-8150 > URL: https://issues.apache.org/jira/browse/LUCENE-8150 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.1, master (9.0) > > Attachments: LUCENE-8150.patch > > > This was the way we wrote pending segment files before we switch to > {{pending_segments_N}} in LUCENE-5925. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8150) Remove references to segments.gen.
[ https://issues.apache.org/jira/browse/LUCENE-8150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793595#comment-16793595 ] Michael McCandless commented on LUCENE-8150: Hmm it looks like this was never committed? > Remove references to segments.gen. > -- > > Key: LUCENE-8150 > URL: https://issues.apache.org/jira/browse/LUCENE-8150 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.1, master (9.0) > > Attachments: LUCENE-8150.patch > > > This was the way we wrote pending segment files before we switch to > {{pending_segments_N}} in LUCENE-5925. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8542) Provide the LeafSlice to CollectorManager.newCollector to save memory on small index slices
[ https://issues.apache.org/jira/browse/LUCENE-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16791706#comment-16791706 ] Michael McCandless commented on LUCENE-8542: +1 to improve slices() to aggregate small slices together by default; that's what we are doing in our production service – we combine up to 5 segments, up to 250K docs in aggregate. > Provide the LeafSlice to CollectorManager.newCollector to save memory on > small index slices > --- > > Key: LUCENE-8542 > URL: https://issues.apache.org/jira/browse/LUCENE-8542 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Christoph Kaser >Priority: Minor > Attachments: LUCENE-8542.patch > > > I have an index consisting of 44 million documents spread across 60 segments. > When I run a query against this index with a huge number of results requested > (e.g. 5 million), this query uses more than 5 GB of heap if the IndexSearch > was configured to use an ExecutorService. > (I know this kind of query is fairly unusual and it would be better to use > paging and searchAfter, but our architecture does not allow this at the > moment.) > The reason for the huge memory requirement is that the search [will create a > TopScoreDocCollector for each > segment|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L404], > each one with numHits = 5 million. This is fine for the large segments, but > many of those segments are fairly small and only contain several thousand > documents. This wastes a huge amount of memory for queries with large values > of numHits on indices with many segments. > Therefore, I propose to change the CollectorManager - interface in the > following way: > * change the method newCollector to accept a parameter LeafSlice that can be > used to determine the total count of documents in the LeafSlice > * Maybe, in order to remain backwards compatible, it would be possible to > introduce this as a new method with a default implementation that calls the > old method - otherwise, it probably has to wait for Lucene 8? > * This can then be used to cap numHits for each TopScoreDocCollector to the > leafslice-size. > If this is something that would make sense for you, I can try to provide a > patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8720) Integer overflow bug in NameIntCacheLRU.makeRoomLRU()
[ https://issues.apache.org/jira/browse/LUCENE-8720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-8720. Resolution: Fixed Fix Version/s: (was: 7.1.1) 8.1 master (9.0) > Integer overflow bug in NameIntCacheLRU.makeRoomLRU() > - > > Key: LUCENE-8720 > URL: https://issues.apache.org/jira/browse/LUCENE-8720 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 7.7.1 > Environment: Mac OS X 10.11.6 but this bug is not affected by the > environment because it is a straightforward integer overflow bug. >Reporter: Russell A Brown >Priority: Major > Labels: easyfix, patch > Fix For: master (9.0), 8.1 > > Attachments: LUCENE-.patch > > > The NameIntCacheLRU.makeRoomLRU() method has an integer overflow bug because > if maxCacheSize >= Integer.MAX_VALUE/2, 2*maxCacheSize will overflow to > -(2^30) and the value of n will overflow to a negative integer as well, which > will prevent any clearing of the cache whatsoever. Hence, performance will > degrade once the cache becomes full because it will be impossible to remove > any entries in order to add new entries to the cache. > Moreover, comments in NameIntCacheLRU.java and LruTaxonomyWriterCache.java > indicate that 2/3 of the cache will be cleared, whereas in fact only 1/3 of > the cache is cleared. So as not to change the behavior of the > NameIntCacheLRU.makeRoomLRU() method, I have not changed the code to clear > 2/3 of the cache but instead I have changed the comments to indicate that 1/3 > of the cache is cleared. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8717) Handle stop words that appear at articulation points
[ https://issues.apache.org/jira/browse/LUCENE-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790698#comment-16790698 ] Michael McCandless commented on LUCENE-8717: +1 for {{TermDeletedAttribute}}. Are we also fixing {{StopFilter}} to set {{TermDeletedAttribute}}? Would this mean that a {{SynonymFilter}} trying to match a synonym containing a stop word would now match even when {{StopFilter}} before it marked the token deleted? > Handle stop words that appear at articulation points > > > Key: LUCENE-8717 > URL: https://issues.apache.org/jira/browse/LUCENE-8717 > Project: Lucene - Core > Issue Type: Bug >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Major > Attachments: LUCENE-8717.patch, LUCENE-8717.patch > > > Our set of TokenFilters currently cannot handle the case where a multi-term > synonym starts with a stopword. This means that given a synonym file > containing the mapping "the walking dead => twd" and a standard english > stopword filter, QueryBuilder will produce incorrect queries. > The tricky part here is that our standard way of dealing with stopwords, > which is to just remove them entirely from the token stream and use a larger > position increment on subsequent tokens, doesn't work when the removed token > also has a position length greater than 1. There are various tricks you can > do to increment position length on the previous token, but this doesn't work > if the stopword is the first token in the token stream, or if there are > multiple stopwords in the side path. > Instead, I'd like to propose adding a new TermDeletedAttribute, which we only > use on tokens that should be removed from the stream but which hold necessary > information about the structure of the token graph. These tokens can then be > removed by GraphTokenStreamFiniteStrings at query time, and by > FlattenGraphFilter at index time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8692) IndexWriter.getTragicException() may not reflect all corrupting exceptions (notably: NoSuchFileException)
[ https://issues.apache.org/jira/browse/LUCENE-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790690#comment-16790690 ] Michael McCandless commented on LUCENE-8692: {{rollback}} gives you a way to close {{IndexWriter}} without doing a commit, which seems useful. If you removed that, what would users do instead? > IndexWriter.getTragicException() may not reflect all corrupting exceptions > (notably: NoSuchFileException) > - > > Key: LUCENE-8692 > URL: https://issues.apache.org/jira/browse/LUCENE-8692 > Project: Lucene - Core > Issue Type: Bug >Reporter: Hoss Man >Priority: Major > Attachments: LUCENE-8692.patch, LUCENE-8692.patch, LUCENE-8692.patch, > LUCENE-8692_test.patch > > > Backstory... > Solr has a "LeaderTragicEventTest" which uses MockDirectoryWrapper's > {{corruptFiles}} to introduce corruption into the "leader" node's index and > then assert that this solr node gives up it's leadership of the shard and > another replica takes over. > This can currently fail sporadically (but usually reproducibly - see > SOLR-13237) due to the leader not giving up it's leadership even after the > corruption causes an update/commit to fail. Solr's leadership code makes this > decision after encountering an exception from the IndexWriter based on wether > {{IndexWriter.getTragicException()}} is (non-)null. > > While investigating this, I created an isolated Lucene-Core equivilent test > that demonstrates the same basic situation: > * Gradually cause corruption on an index untill (otherwise) valid execution > of IW.add() + IW.commit() calls throw an exception to the IW client. > * assert that if an exception is thrown to the IW client, > {{getTragicException()}} is now non-null. > It's fairly easy to make my new test fail reproducibly – in every situation > I've seen the underlying exception is a {{NoSuchFileException}} (ie: the > randomly introduced corruption was to delete some file). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8720) Integer overflow bug in NameIntCacheLRU.makeRoomLRU()
[ https://issues.apache.org/jira/browse/LUCENE-8720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790681#comment-16790681 ] Michael McCandless commented on LUCENE-8720: Thanks [~kirigirisu], nice catch – I'll pass tests and push soon. > Integer overflow bug in NameIntCacheLRU.makeRoomLRU() > - > > Key: LUCENE-8720 > URL: https://issues.apache.org/jira/browse/LUCENE-8720 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 7.7.1 > Environment: Mac OS X 10.11.6 but this bug is not affected by the > environment because it is a straightforward integer overflow bug. >Reporter: Russell A Brown >Priority: Major > Labels: easyfix, patch > Fix For: 7.1.1 > > Attachments: LUCENE-.patch > > > The NameIntCacheLRU.makeRoomLRU() method has an integer overflow bug because > if maxCacheSize >= Integer.MAX_VALUE/2, 2*maxCacheSize will overflow to > -(2^30) and the value of n will overflow to a negative integer as well, which > will prevent any clearing of the cache whatsoever. Hence, performance will > degrade once the cache becomes full because it will be impossible to remove > any entries in order to add new entries to the cache. > Moreover, comments in NameIntCacheLRU.java and LruTaxonomyWriterCache.java > indicate that 2/3 of the cache will be cleared, whereas in fact only 1/3 of > the cache is cleared. So as not to change the behavior of the > NameIntCacheLRU.makeRoomLRU() method, I have not changed the code to clear > 2/3 of the cache but instead I have changed the comments to indicate that 1/3 > of the cache is cleared. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8542) Provide the LeafSlice to CollectorManager.newCollector to save memory on small index slices
[ https://issues.apache.org/jira/browse/LUCENE-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790677#comment-16790677 ] Michael McCandless commented on LUCENE-8542: Maybe we should try swapping in the JDK's {{PriorityQueue}} and measure if this really hurts search throughput? > Provide the LeafSlice to CollectorManager.newCollector to save memory on > small index slices > --- > > Key: LUCENE-8542 > URL: https://issues.apache.org/jira/browse/LUCENE-8542 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Christoph Kaser >Priority: Minor > Attachments: LUCENE-8542.patch > > > I have an index consisting of 44 million documents spread across 60 segments. > When I run a query against this index with a huge number of results requested > (e.g. 5 million), this query uses more than 5 GB of heap if the IndexSearch > was configured to use an ExecutorService. > (I know this kind of query is fairly unusual and it would be better to use > paging and searchAfter, but our architecture does not allow this at the > moment.) > The reason for the huge memory requirement is that the search [will create a > TopScoreDocCollector for each > segment|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L404], > each one with numHits = 5 million. This is fine for the large segments, but > many of those segments are fairly small and only contain several thousand > documents. This wastes a huge amount of memory for queries with large values > of numHits on indices with many segments. > Therefore, I propose to change the CollectorManager - interface in the > following way: > * change the method newCollector to accept a parameter LeafSlice that can be > used to determine the total count of documents in the LeafSlice > * Maybe, in order to remain backwards compatible, it would be possible to > introduce this as a new method with a default implementation that calls the > old method - otherwise, it probably has to wait for Lucene 8? > * This can then be used to cap numHits for each TopScoreDocCollector to the > leafslice-size. > If this is something that would make sense for you, I can try to provide a > patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8542) Provide the LeafSlice to CollectorManager.newCollector to save memory on small index slices
[ https://issues.apache.org/jira/browse/LUCENE-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790576#comment-16790576 ] Michael McCandless commented on LUCENE-8542: I think the core API change is quite minor and reasonable – letting the {{Collector.newCollector}} know which segments (slice) it will collect? E.g. we already pass the {{LeafReaderContext}} to {{Collector.newLeafCollector}} so it's informed about the details of which segment it's about to collect. I agree the motivating use case here is somewhat abusive, and a custom Collector is probably needed anyway, but I think this API change could help non-abusive cases too. Alternatively we could explore fixing our default top hits collectors to not pre-allocate the full topN for every slice ... that is really unexpected behavior, and users have tripped up on this multiple times in the past causing us to make some partial fixes for it. > Provide the LeafSlice to CollectorManager.newCollector to save memory on > small index slices > --- > > Key: LUCENE-8542 > URL: https://issues.apache.org/jira/browse/LUCENE-8542 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Christoph Kaser >Priority: Minor > Attachments: LUCENE-8542.patch > > > I have an index consisting of 44 million documents spread across 60 segments. > When I run a query against this index with a huge number of results requested > (e.g. 5 million), this query uses more than 5 GB of heap if the IndexSearch > was configured to use an ExecutorService. > (I know this kind of query is fairly unusual and it would be better to use > paging and searchAfter, but our architecture does not allow this at the > moment.) > The reason for the huge memory requirement is that the search [will create a > TopScoreDocCollector for each > segment|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L404], > each one with numHits = 5 million. This is fine for the large segments, but > many of those segments are fairly small and only contain several thousand > documents. This wastes a huge amount of memory for queries with large values > of numHits on indices with many segments. > Therefore, I propose to change the CollectorManager - interface in the > following way: > * change the method newCollector to accept a parameter LeafSlice that can be > used to determine the total count of documents in the LeafSlice > * Maybe, in order to remain backwards compatible, it would be possible to > introduce this as a new method with a default implementation that calls the > old method - otherwise, it probably has to wait for Lucene 8? > * This can then be used to cap numHits for each TopScoreDocCollector to the > leafslice-size. > If this is something that would make sense for you, I can try to provide a > patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8216) Better cross-field scoring
[ https://issues.apache.org/jira/browse/LUCENE-8216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775884#comment-16775884 ] Michael McCandless commented on LUCENE-8216: Can this be resolved now? Looks like [~jim.ferenczi] pushed the new query to sandbox? > Better cross-field scoring > -- > > Key: LUCENE-8216 > URL: https://issues.apache.org/jira/browse/LUCENE-8216 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Major > Fix For: 8.0 > > Attachments: LUCENE-8216.patch, LUCENE-8216.patch > > > I'd like Lucene to have better support for scoring across multiple fields. > Today we have BlendedTermQuery which tries to help there but it probably > tries to do too much on some aspects (handling cross-field term queries AND > synonyms) and too little on other ones (it tries to merge index-level > statistics, but not per-document statistics like tf and norm). > Maybe we could implement something like BM25F so that queries across multiple > fields would retain the benefits of BM25 like the fact that the impact of the > term frequency saturates quickly, which is not the case with BlendedTermQuery > if you have occurrences across many fields. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8703) Build point writers only when needed on the BKD tree
[ https://issues.apache.org/jira/browse/LUCENE-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775181#comment-16775181 ] Michael McCandless commented on LUCENE-8703: It'd be nice to have a metric in our nightly points benchmarks measuring how much heap was required while building the index. > Build point writers only when needed on the BKD tree > > > Key: LUCENE-8703 > URL: https://issues.apache.org/jira/browse/LUCENE-8703 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ignacio Vera >Priority: Major > Attachments: LUCENE-8703.patch, LUCENE-8703.patch, LUCENE-8703.patch > > > With the introduction of LUCENE-8699, I have realised the BKD tree uses quite > a lot of heap even when it is not needed, for example for 1D points. > In this issue I propose to create point writers only when needed. In addition > I propose to create PointWriters based on the estimated point count given in > the constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8671) Add setting for moving FST offheap/onheap
[ https://issues.apache.org/jira/browse/LUCENE-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775190#comment-16775190 ] Michael McCandless commented on LUCENE-8671: [~akjain] that's true, maybe we don't need per-field control and a single boolean option would work? We could maybe add a setter on {{BlockTreeTermsWriter}}? And it'd write that setting into the index, and {{BlockTreeTermsReader}} would read that and then load FSTs on or off heap. > Add setting for moving FST offheap/onheap > - > > Key: LUCENE-8671 > URL: https://issues.apache.org/jira/browse/LUCENE-8671 > Project: Lucene - Core > Issue Type: New Feature > Components: core/FSTs, core/store >Reporter: Ankit Jain >Priority: Minor > Attachments: offheap_generic_settings.patch, offheap_settings.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > While LUCENE-8635, adds support for loading FST offheap using mmap, users do > not have the flexibility to specify fields for which FST needs to be > offheap. This allows users to tune heap usage as per their workload. > Ideal way will be to add an attribute to FieldInfo, where we have > put/getAttribute. Then FieldReader can inspect the FieldInfo and pass the > appropriate On/OffHeapStore when creating its FST. It can support special > keywords like ALL/NONE. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap
[ https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-8635. Resolution: Fixed Fix Version/s: master (9.0) 8.x 8.0 Thanks [~akjain]! > Lazy loading Lucene FST offheap using mmap > -- > > Key: LUCENE-8635 > URL: https://issues.apache.org/jira/browse/LUCENE-8635 > Project: Lucene - Core > Issue Type: New Feature > Components: core/FSTs > Environment: I used below setup for es_rally tests: > single node i3.xlarge running ES 6.5 > es_rally was running on another i3.xlarge instance >Reporter: Ankit Jain >Priority: Major > Fix For: 8.0, 8.x, master (9.0) > > Attachments: fst-offheap-ra-rev.patch, fst-offheap-rev.patch, > offheap.patch, optional_offheap_ra.patch, ra.patch, rally_benchmark.xlsx > > > Currently, FST loads all the terms into heap memory during index open. This > causes frequent JVM OOM issues if the term size gets big. A better way of > doing this will be to lazily load FST using mmap. That ensures only the > required terms get loaded into memory. > > Lucene can expose API for providing list of fields to load terms offheap. I'm > planning to take following approach for this: > # Add a boolean property fstOffHeap in FieldInfo > # Pass list of offheap fields to lucene during index open (ALL can be > special keyword for loading ALL fields offheap) > # Initialize the fstOffHeap property during lucene index open > # FieldReader invokes default FST constructor or OffHeap constructor based > on fstOffHeap field > > I created a patch (that loads all fields offheap), did some benchmarks using > es_rally and results look good. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap
[ https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772153#comment-16772153 ] Michael McCandless commented on LUCENE-8635: I ran luceneutil on {{wikimediumall}} with current trunk vs PR here – net/net looks like noise, which is great – I'll push shortly: {noformat} Report after iter 19: Task QPS base StdDev QPS comp StdDev Pct diff Prefix3 37.05 (11.4%) 36.25 (13.0%) -2.1% ( -23% - 25%) BrowseMonthSSDVFacets 5.01 (6.4%) 4.91 (10.4%) -1.9% ( -17% - 15%) BrowseMonthTaxoFacets 1.24 (2.7%) 1.22 (4.8%) -1.3% ( -8% - 6%) Wildcard 106.53 (8.6%) 105.18 (9.1%) -1.3% ( -17% - 18%) HighTermDayOfYearSort 14.85 (4.2%) 14.70 (4.2%) -1.0% ( -9% - 7%) BrowseDateTaxoFacets 1.11 (3.2%) 1.10 (5.6%) -0.8% ( -9% - 8%) BrowseDayOfYearTaxoFacets 1.11 (3.1%) 1.10 (5.6%) -0.8% ( -9% - 8%) MedSloppyPhrase 4.59 (3.4%) 4.56 (2.8%) -0.5% ( -6% - 5%) Fuzzy2 68.49 (1.0%) 68.12 (1.3%) -0.5% ( -2% - 1%) LowSpanNear 30.34 (1.7%) 30.19 (1.9%) -0.5% ( -4% - 3%) Fuzzy1 72.43 (0.9%) 72.10 (1.4%) -0.5% ( -2% - 1%) LowPhrase 34.35 (1.1%) 34.22 (2.0%) -0.4% ( -3% - 2%) Respell 47.66 (1.4%) 47.48 (1.7%) -0.4% ( -3% - 2%) LowSloppyPhrase 10.59 (4.9%) 10.56 (3.6%) -0.3% ( -8% - 8%) HighTerm 1290.39 (1.8%) 1286.15 (1.4%) -0.3% ( -3% - 2%) MedTerm 1419.25 (2.0%) 1415.23 (1.5%) -0.3% ( -3% - 3%) IntNRQ 27.03 (11.0%) 26.96 (10.9%) -0.3% ( -19% - 24%) HighSloppyPhrase 6.73 (4.9%) 6.71 (3.4%) -0.3% ( -8% - 8%) OrNotHighHigh 825.79 (1.9%) 823.77 (1.4%) -0.2% ( -3% - 3%) OrNotHighMed 912.80 (1.3%) 910.96 (1.3%) -0.2% ( -2% - 2%) MedPhrase 29.52 (1.1%) 29.46 (1.9%) -0.2% ( -3% - 2%) OrHighNotLow 1184.54 (3.1%) 1182.86 (1.8%) -0.1% ( -4% - 4%) LowTerm 974.30 (1.5%) 973.33 (1.4%) -0.1% ( -2% - 2%) OrHighLow 328.39 (1.0%) 328.13 (1.0%) -0.1% ( -2% - 1%) AndHighHigh 21.04 (2.8%) 21.03 (2.6%) -0.1% ( -5% - 5%) OrHighNotHigh 907.78 (1.8%) 907.93 (1.4%) 0.0% ( -3% - 3%) OrHighNotMed 1019.49 (2.0%) 1019.67 (1.4%) 0.0% ( -3% - 3%) AndHighMed 64.27 (1.1%) 64.33 (1.1%) 0.1% ( -2% - 2%) OrNotHighLow 414.78 (1.2%) 415.43 (1.0%) 0.2% ( -2% - 2%) BrowseDayOfYearSSDVFacets 4.14 (6.9%) 4.15 (8.9%) 0.2% ( -14% - 17%) AndHighLow 371.09 (1.7%) 371.84 (1.7%) 0.2% ( -3% - 3%) OrHighMed 65.31 (1.8%) 65.45 (1.8%) 0.2% ( -3% - 3%) PKLookup 141.21 (1.6%) 141.63 (1.9%) 0.3% ( -3% - 3%) HighSpanNear 25.84 (2.8%) 25.94 (2.6%) 0.4% ( -4% - 5%) MedSpanNear 26.39 (2.9%) 26.50 (2.8%) 0.4% ( -5% - 6%) HighPhrase 11.72 (2.1%) 11.77 (1.9%) 0.4% ( -3% - 4%) OrHighHigh 14.60 (2.2%) 14.69 (1.8%) 0.6% ( -3% - 4%) HighTermMonthSort 31.51 (6.0%) 31.90 (6.0%) 1.2% ( -10% - 14%){noformat} > Lazy loading Lucene FST offheap using mmap > -- > > Key: LUCENE-8635 > URL: https://issues.apache.org/jira/browse/LUCENE-8635 > Project: Lucene - Core > Issue Type: New Feature > Components: core/FSTs > Environment: I used below setup for es_rally tests: > single node i3.xlarge running ES 6.5 > es_rally was running on another i3.xlarge instance >Reporter: Ankit Jain >Priority: Major > Attachments: fst-offheap-ra-rev.patch, fst-offheap-rev.patch, > offheap.patch, optional_offheap_ra.patch, ra.patch, rally_benchmark.xlsx > > > Currently, FST loads all the terms into heap memory
[jira] [Commented] (LUCENE-8671) Add setting for moving FST offheap/onheap
[ https://issues.apache.org/jira/browse/LUCENE-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772068#comment-16772068 ] Michael McCandless commented on LUCENE-8671: Actually I think this is a good use case for the existing attributes in {{FieldInfo}} – this sort of extensibility is exactly why we have attributes. But can you use the existing {{attributes}} instead of adding a new {{readerAttributes}}? And could we make this something a custom {{Codec}} impl would set? Then we shouldn't need any changes to {{FieldInfo.java}}, {{IndexWriter.java}}, {{LiveIndexWriterConfig.java}}, etc. We'd just make a custom codec setting this attribute for fields where we want to override Lucene's ({{BlockTreeTermReader}}'s) default behavior. Yes, it'd mean one must commit at indexing time as to which fields will be on vs off heap at search time, but I think that's an OK tradeoff? > Add setting for moving FST offheap/onheap > - > > Key: LUCENE-8671 > URL: https://issues.apache.org/jira/browse/LUCENE-8671 > Project: Lucene - Core > Issue Type: New Feature > Components: core/FSTs, core/store >Reporter: Ankit Jain >Priority: Minor > Attachments: offheap_generic_settings.patch, offheap_settings.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > While LUCENE-8635, adds support for loading FST offheap using mmap, users do > not have the flexibility to specify fields for which FST needs to be > offheap. This allows users to tune heap usage as per their workload. > Ideal way will be to add an attribute to FieldInfo, where we have > put/getAttribute. Then FieldReader can inspect the FieldInfo and pass the > appropriate On/OffHeapStore when creating its FST. It can support special > keywords like ALL/NONE. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8675) Divide Segment Search Amongst Multiple Threads
[ https://issues.apache.org/jira/browse/LUCENE-8675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16759440#comment-16759440 ] Michael McCandless commented on LUCENE-8675: {quote}If some segments are getting large enough that intra-segment parallelism becomes appealing, then maybe an easier and more efficient way to increase parallelism is to instead reduce the maximum segment size so that inter-segment parallelism has more potential for parallelizing query execution. {quote} Yeah that is a good workaround given how Lucene works today. It's essentially the same as your original suggestion ("make more shards and search them concurrently"), just at the segment instead of shard level. But this still adds some costs -- the per-segment fixed cost for each query. That cost should be less than the per shard fixed cost in the sharded case, but it's still adding some cost. If instead Lucene had a way to divide large segments into multiple work units (and I agree there are challenges with that! -- not just BKD and multi-term queries, but e.g. how would early termination work?) then we could pay that per-segment fixed cost once for such segments then let multiple threads share the variable cost work of finding and ranking hits. In our recently launched production index we see sizable jumps in the P99+ query latencies when a large segment merges finish and replicate, because we are using "thread per segment" concurrency that we are hoping we could improve by pushing thread concurrency into individual large segments. > Divide Segment Search Amongst Multiple Threads > -- > > Key: LUCENE-8675 > URL: https://issues.apache.org/jira/browse/LUCENE-8675 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Atri Sharma >Priority: Major > > Segment search is a single threaded operation today, which can be a > bottleneck for large analytical queries which index a lot of data and have > complex queries which touch multiple segments (imagine a composite query with > range query and filters on top). This ticket is for discussing the idea of > splitting a single segment into multiple threads based on mutually exclusive > document ID ranges. > This will be a two phase effort, the first phase targeting queries returning > all matching documents (collectors not terminating early). The second phase > patch will introduce staged execution and will build on top of this patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13190) Fuzzy search treated as server error instead of client error when terms are too complex
[ https://issues.apache.org/jira/browse/SOLR-13190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16759431#comment-16759431 ] Michael McCandless commented on SOLR-13190: --- +1 to improve the exception message to include the field and fuzzy term that led to this. However, this exception is baffling because the way our FuzzyQuery works is to directly produce an already determinized and minimized automaton – that's the beauty of the (efficient) Levenshtein automaton construction algorithm. So why are we then trying to determinize it again? Something bad is lurking here – somehow we lost track that the automaton is already determinized? > Fuzzy search treated as server error instead of client error when terms are > too complex > --- > > Key: SOLR-13190 > URL: https://issues.apache.org/jira/browse/SOLR-13190 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: search >Affects Versions: master (9.0) >Reporter: Mike Drob >Assignee: Mike Drob >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We've seen a fuzzy search end up breaking the automaton and getting reported > as a server error. This usage should be improved by > 1) reporting as a client error, because it's similar to something like too > many boolean clauses queries in how an operator should deal with it > 2) report what field is causing the error, since that currently must be > deduced from adjacent query logs and can be difficult if there are multiple > terms in the search > This trigger was added to defend against adversarial regex but somehow hits > fuzzy terms as well, I don't understand enough about the automaton mechanisms > to really know how to approach a fix there, but improving the operability is > a good first step. > relevant stack trace: > {noformat} > org.apache.lucene.util.automaton.TooComplexToDeterminizeException: > Determinizing automaton with 13632 states and 21348 transitions would result > in more than 1 states. > at > org.apache.lucene.util.automaton.Operations.determinize(Operations.java:746) > at > org.apache.lucene.util.automaton.RunAutomaton.(RunAutomaton.java:69) > at > org.apache.lucene.util.automaton.ByteRunAutomaton.(ByteRunAutomaton.java:32) > at > org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247) > at > org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:133) > at > org.apache.lucene.search.FuzzyTermsEnum.(FuzzyTermsEnum.java:143) > at org.apache.lucene.search.FuzzyQuery.getTermsEnum(FuzzyQuery.java:154) > at > org.apache.lucene.search.MultiTermQuery$RewriteMethod.getTermsEnum(MultiTermQuery.java:78) > at > org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:58) > at > org.apache.lucene.search.TopTermsRewrite.rewrite(TopTermsRewrite.java:67) > at > org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:310) > at > org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:667) > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:442) > at > org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:200) > at > org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1604) > at > org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1420) > at > org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:567) > at > org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1435) > at > org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:374) > at > org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:298) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2559) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8675) Divide Segment Search Amongst Multiple Threads
[ https://issues.apache.org/jira/browse/LUCENE-8675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758451#comment-16758451 ] Michael McCandless commented on LUCENE-8675: I think it'd be interesting to explore intra-segment parallelism, but I agree w/ [~jpountz] that there are challenges :) If you pass an {{ExecutorService}} to {{IndexSearcher}} today you can already use multiple threads to answer one query, but the concurrency is tied to your segment geometry and annoyingly a supposedly "optimized" index gets no concurrency ;) But if you do have many segments, this can give a nice reduction to query latencies when QPS is well below the searcher's red-line capacity (probably at the expense of some hopefully small loss of red-line throughput because of the added overhead of thread scheduling). For certain use cases (large index, low typical query rate) this is a powerful approach. It's true that one can also divide an index into more shards and run each shard concurrently but then you are also multiplying the fixed query setup cost which in some cases can be relatively significant. {quote}Parallelizing based on ranges of doc IDs is problematic for some queries, for instance the cost of evaluating a range query over an entire segment or only about a specific range of doc IDs is exactly the same given that it uses data-structures that are organized by value rather than by doc ID. {quote} Yeah that's a real problem – these queries traverse the BKD tree per-segment while creating the scorer, which is/can be the costly part, and then produce a bit set which is very fast to iterate over. This phase is not separately visible to the caller, unlike e.g. rewrite that MultiTermQueries use to translate into simpler queries, so it'd be tricky to build intra-segment concurrency on top ... > Divide Segment Search Amongst Multiple Threads > -- > > Key: LUCENE-8675 > URL: https://issues.apache.org/jira/browse/LUCENE-8675 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Atri Sharma >Priority: Major > > Segment search is a single threaded operation today, which can be a > bottleneck for large analytical queries which index a lot of data and have > complex queries which touch multiple segments (imagine a composite query with > range query and filters on top). This ticket is for discussing the idea of > splitting a single segment into multiple threads based on mutually exclusive > document ID ranges. > This will be a two phase effort, the first phase targeting queries returning > all matching documents (collectors not terminating early). The second phase > patch will introduce staged execution and will build on top of this patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap
[ https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758431#comment-16758431 ] Michael McCandless commented on LUCENE-8635: {quote}Better would be an attribute of {{FieldInfo}}, where we have {{put/getAttribute}}. Then {{FieldReader }}can inspect the {{FieldInfo}} and pass the appropriate {{On/OffHeapStore}} when creating its {{FST}}. What do you think? {quote} Hmm that's also an interesting approach to get per-field control. One can set these attributes in a custom {{FieldType}} when indexing documents, or maybe in a custom codec at write time (just subclassing e.g. {{Lucene80Codec}}), or at read time using a real (named) custom codec. So we would pick a specific string ({{FST_OFF_HEAP}} or something) and define that as a string constant which users could then use for setting the attribute? So ... maybe we have a default behavior w/ Adrien's cool idea, but then also allow the attribute to give per-field control? We should probably also by default (if the field attribute is not present) not do off-heap when the directory is not MMapDirectory? We haven't tested the other directory impls but I suspect they'd be quite a bit slower with off-heap FST? {quote}Given that reversing the index during write to make it forward reading didn't help the performance (in addition to it not being backward compatible), is the consensus to add exception for PK and directories other than mmap for offheap FST in [^ra.patch]? {quote} Yeah +1 to keep the two changes separated. > Lazy loading Lucene FST offheap using mmap > -- > > Key: LUCENE-8635 > URL: https://issues.apache.org/jira/browse/LUCENE-8635 > Project: Lucene - Core > Issue Type: New Feature > Components: core/FSTs > Environment: I used below setup for es_rally tests: > single node i3.xlarge running ES 6.5 > es_rally was running on another i3.xlarge instance >Reporter: Ankit Jain >Priority: Major > Attachments: fst-offheap-ra-rev.patch, fst-offheap-rev.patch, > offheap.patch, optional_offheap_ra.patch, ra.patch, rally_benchmark.xlsx > > > Currently, FST loads all the terms into heap memory during index open. This > causes frequent JVM OOM issues if the term size gets big. A better way of > doing this will be to lazily load FST using mmap. That ensures only the > required terms get loaded into memory. > > Lucene can expose API for providing list of fields to load terms offheap. I'm > planning to take following approach for this: > # Add a boolean property fstOffHeap in FieldInfo > # Pass list of offheap fields to lucene during index open (ALL can be > special keyword for loading ALL fields offheap) > # Initialize the fstOffHeap property during lucene index open > # FieldReader invokes default FST constructor or OffHeap constructor based > on fstOffHeap field > > I created a patch (that loads all fields offheap), did some benchmarks using > es_rally and results look good. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap
[ https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755374#comment-16755374 ] Michael McCandless commented on LUCENE-8635: Oooh I like that proposal [~jpountz]! > Lazy loading Lucene FST offheap using mmap > -- > > Key: LUCENE-8635 > URL: https://issues.apache.org/jira/browse/LUCENE-8635 > Project: Lucene - Core > Issue Type: New Feature > Components: core/FSTs > Environment: I used below setup for es_rally tests: > single node i3.xlarge running ES 6.5 > es_rally was running on another i3.xlarge instance >Reporter: Ankit Jain >Priority: Major > Attachments: fst-offheap-ra-rev.patch, offheap.patch, > optional_offheap_ra.patch, ra.patch, rally_benchmark.xlsx > > > Currently, FST loads all the terms into heap memory during index open. This > causes frequent JVM OOM issues if the term size gets big. A better way of > doing this will be to lazily load FST using mmap. That ensures only the > required terms get loaded into memory. > > Lucene can expose API for providing list of fields to load terms offheap. I'm > planning to take following approach for this: > # Add a boolean property fstOffHeap in FieldInfo > # Pass list of offheap fields to lucene during index open (ALL can be > special keyword for loading ALL fields offheap) > # Initialize the fstOffHeap property during lucene index open > # FieldReader invokes default FST constructor or OffHeap constructor based > on fstOffHeap field > > I created a patch (that loads all fields offheap), did some benchmarks using > es_rally and results look good. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap
[ https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755344#comment-16755344 ] Michael McCandless commented on LUCENE-8635: OK net/net it looks like there is a small performance impact for some queries, and biggish (-7-8%) impact for {{PKLookup.}} But this is a nice option to have for users who are heap constrained by the FSTs, so I wonder how we could add this option off by default? E.g. users might want their {{id}} field to store the FST in heap (like today), but all other fields off-heap. There is no index format change required here, which is nice, but Lucene doesn't make it easy to have read-time codec behavior changes, so maybe the solution is that at write-time we add an option e.g. to {{BlockTreeTermsWriter}} and it stores this in the index and then at read-time {{BlockTreeTermsReader}} checks that option and loads the FST accordingly? Then users could customize their codecs to achieve this. Or I suppose we could add a global system property, e.g. our default stored fields writer has a property to turn on/off bulk merge, but I think we are trying not to use Java properties going forward? Can anyone think of any other approaches to make this option possible? > Lazy loading Lucene FST offheap using mmap > -- > > Key: LUCENE-8635 > URL: https://issues.apache.org/jira/browse/LUCENE-8635 > Project: Lucene - Core > Issue Type: New Feature > Components: core/FSTs > Environment: I used below setup for es_rally tests: > single node i3.xlarge running ES 6.5 > es_rally was running on another i3.xlarge instance >Reporter: Ankit Jain >Priority: Major > Attachments: fst-offheap-ra-rev.patch, offheap.patch, > optional_offheap_ra.patch, ra.patch, rally_benchmark.xlsx > > > Currently, FST loads all the terms into heap memory during index open. This > causes frequent JVM OOM issues if the term size gets big. A better way of > doing this will be to lazily load FST using mmap. That ensures only the > required terms get loaded into memory. > > Lucene can expose API for providing list of fields to load terms offheap. I'm > planning to take following approach for this: > # Add a boolean property fstOffHeap in FieldInfo > # Pass list of offheap fields to lucene during index open (ALL can be > special keyword for loading ALL fields offheap) > # Initialize the fstOffHeap property during lucene index open > # FieldReader invokes default FST constructor or OffHeap constructor based > on fstOffHeap field > > I created a patch (that loads all fields offheap), did some benchmarks using > es_rally and results look good. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8653) Reverse FST storage so it can be read forward
[ https://issues.apache.org/jira/browse/LUCENE-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748985#comment-16748985 ] Michael McCandless commented on LUCENE-8653: Impressive how simple this was! I think it's simpler to think about, reading the {{byte[]}} in forward order, and it ought to be a bit more cache friendly. I agree jumping between FST nodes is very random access, but e.g. at a given node as we scan the arcs looking for a match that would become sequential byte reads with this change. Curious the impact is neutral, but maybe if we combine this with LUCENE-8635 we can measure an impact? > Reverse FST storage so it can be read forward > - > > Key: LUCENE-8653 > URL: https://issues.apache.org/jira/browse/LUCENE-8653 > Project: Lucene - Core > Issue Type: Improvement > Components: core/FSTs >Reporter: Mike Sokolov >Priority: Major > Attachments: fst-reverse.patch > > > Discussion of keeping FST off-heap led to the idea of ensuring that FST's can > be read forward in order to be more cache-friendly and align better with > standard I/O practice. Today FSTs are read in reverse and this leads to some > awkwardness, and you can't use standard readers so the code can be confusing > to work with. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8618) MMapDirectory's read ahead on random-access files might trash the OS cache
[ https://issues.apache.org/jira/browse/LUCENE-8618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16747963#comment-16747963 ] Michael McCandless commented on LUCENE-8618: Was the index cold(ish) in this use case? I.e. the 2 MB read-ahead was consuming valuable IO resources that were better spent on the other IOPs actually needed for the use case. > MMapDirectory's read ahead on random-access files might trash the OS cache > -- > > Key: LUCENE-8618 > URL: https://issues.apache.org/jira/browse/LUCENE-8618 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > At Elastic we were reported a case which runs significantly slower with > MMapDirectory than with NIOFSDirectory. After a long analysis, we discovered > that it had to do with MMapDirectory's read ahead of 2MB, which doesn't help > and even trashes the OS cache on stored fields and term vectors files which > have a fully random access pattern (except at merge time). > The particular use-case that exhibits the slow-down is performing updates, > ie. we first look up a document based on its id, fetch stored fields, compute > new stored fields (eg. after adding or changing the value of a field) and add > the document back to the index. We were able to reproduce the workload that > this Elasticsearch user described and measured a median throughput of 3600 > updates/s with MMapDirectory and 5000 updates/s with NIOFSDirectory. It even > goes up to 5600 updates/s if you configure a FileSwitchDirectory to use > MMapDirectory for the terms dictionary and NIOFSDirectory for stored fields > (postings files are not relevant here since postings are inlined in the terms > dict when docFreq=1 and indexOptions=DOCS). > While it is possible to work around this issue on top of Lucene, maybe this > is something that we could improve directly in Lucene, eg. by propagating > information about the expected access pattern and avoiding mmap on files that > have a fully random access pattern (until Java exposes madvise in some way)? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap
[ https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745253#comment-16745253 ] Michael McCandless commented on LUCENE-8635: OK thanks [~sokolov]. I'll try to also run bench on wikibig and report back. I think doing a single method call instead of the two (seek + read) via {{RandomAccessInput}} must be helping. {quote}The thing that makes me want to be careful here is that access to the terms index is very random, so things might degrade badly if the OS cache doesn't hold the whole terms index in memory. {quote} I think net/net we are already relying on OS to do the right thing here. As things stand today, the OS could also swap out the heap pages that hold the FST's {{byte[]}} depending on its swappiness (on Linux). {quote}I'm not super familiar with the FST internals, I wonder whether there are changes that we could make to it so that it would be more disk-friendly, eg. by seeking backward as little as possible when looking up a key? {quote} {{We used to have a }}{{pack}} method in FST that would 1) try to further compress the {{byte[]}} size by moving nodes "closer" to the nodes that transitioned to them, and 2) reversing the bytes. But we removed that method because it added complexity and nobody was really using it and sometimes it even made the FST bigger! Maybe, we could bring the method back, but only part 2) of it, and always call it at the end of building an FST? That should be simpler code (without part 1), and should achieve sequential reads of at least the bytes to decode a single transition; maybe it gives a performance jump independent of this change? But I think we really should explore that independently of this issue ... I think as long as additional performance tests show only these smallish impacts to real queries we should just make the change across the board for terms dictionary index? > Lazy loading Lucene FST offheap using mmap > -- > > Key: LUCENE-8635 > URL: https://issues.apache.org/jira/browse/LUCENE-8635 > Project: Lucene - Core > Issue Type: New Feature > Components: core/FSTs > Environment: I used below setup for es_rally tests: > single node i3.xlarge running ES 6.5 > es_rally was running on another i3.xlarge instance >Reporter: Ankit Jain >Priority: Major > Attachments: offheap.patch, ra.patch, rally_benchmark.xlsx > > > Currently, FST loads all the terms into heap memory during index open. This > causes frequent JVM OOM issues if the term size gets big. A better way of > doing this will be to lazily load FST using mmap. That ensures only the > required terms get loaded into memory. > > Lucene can expose API for providing list of fields to load terms offheap. I'm > planning to take following approach for this: > # Add a boolean property fstOffHeap in FieldInfo > # Pass list of offheap fields to lucene during index open (ALL can be > special keyword for loading ALL fields offheap) > # Initialize the fstOffHeap property during lucene index open > # FieldReader invokes default FST constructor or OffHeap constructor based > on fstOffHeap field > > I created a patch (that loads all fields offheap), did some benchmarks using > es_rally and results look good. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap
[ https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16744538#comment-16744538 ] Michael McCandless commented on LUCENE-8635: Thanks [~sokolov] – those numbers look quite a bit better! Though, your QPSs are kinda high overall – how many Wikipedia docs were in your index? I do wonder if we simply reversed the FST's byte[] when we create it, what impact that'd have on lookup performance. Hmm even if we did that, we'd still have to {{readBytes}} one byte at a time since {{RandomAccessInput}} does not have a {{readBytes}} method? But ... maybe {{IndexInput}} would give good performance in that case? We should probably pursue that separately though... > Lazy loading Lucene FST offheap using mmap > -- > > Key: LUCENE-8635 > URL: https://issues.apache.org/jira/browse/LUCENE-8635 > Project: Lucene - Core > Issue Type: New Feature > Components: core/FSTs > Environment: I used below setup for es_rally tests: > single node i3.xlarge running ES 6.5 > es_rally was running on another i3.xlarge instance >Reporter: Ankit Jain >Priority: Major > Attachments: offheap.patch, ra.patch, rally_benchmark.xlsx > > > Currently, FST loads all the terms into heap memory during index open. This > causes frequent JVM OOM issues if the term size gets big. A better way of > doing this will be to lazily load FST using mmap. That ensures only the > required terms get loaded into memory. > > Lucene can expose API for providing list of fields to load terms offheap. I'm > planning to take following approach for this: > # Add a boolean property fstOffHeap in FieldInfo > # Pass list of offheap fields to lucene during index open (ALL can be > special keyword for loading ALL fields offheap) > # Initialize the fstOffHeap property during lucene index open > # FieldReader invokes default FST constructor or OffHeap constructor based > on fstOffHeap field > > I created a patch (that loads all fields offheap), did some benchmarks using > es_rally and results look good. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap
[ https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743137#comment-16743137 ] Michael McCandless commented on LUCENE-8635: Thanks for testing [~sokolov] – the results make sense: the most terms dictionary intensive queries are impacted the most, with {{PKLookup}} being heavily impacted since that's just purely exercising the terms dictionary with no postings visited. Fuzzy queries, and then queries matching few hits (conjunctions with low/medium freq terms) also spend relatively more time in the terms dictionary ... So net/net it looks like we should not make this the default, but expose it somehow as an option for those use cases that don't want to dedicate heap memory to storing FSTs? > Lazy loading Lucene FST offheap using mmap > -- > > Key: LUCENE-8635 > URL: https://issues.apache.org/jira/browse/LUCENE-8635 > Project: Lucene - Core > Issue Type: New Feature > Components: core/FSTs > Environment: I used below setup for es_rally tests: > single node i3.xlarge running ES 6.5 > es_rally was running on another i3.xlarge instance >Reporter: Ankit Jain >Priority: Major > Attachments: offheap.patch, rally_benchmark.xlsx > > > Currently, FST loads all the terms into heap memory during index open. This > causes frequent JVM OOM issues if the term size gets big. A better way of > doing this will be to lazily load FST using mmap. That ensures only the > required terms get loaded into memory. > > Lucene can expose API for providing list of fields to load terms offheap. I'm > planning to take following approach for this: > # Add a boolean property fstOffHeap in FieldInfo > # Pass list of offheap fields to lucene during index open (ALL can be > special keyword for loading ALL fields offheap) > # Initialize the fstOffHeap property during lucene index open > # FieldReader invokes default FST constructor or OffHeap constructor based > on fstOffHeap field > > I created a patch (that loads all fields offheap), did some benchmarks using > es_rally and results look good. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap
[ https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740764#comment-16740764 ] Michael McCandless commented on LUCENE-8635: Also, have you confirmed that all tests pass when you switch to off heap FST storage always? > Lazy loading Lucene FST offheap using mmap > -- > > Key: LUCENE-8635 > URL: https://issues.apache.org/jira/browse/LUCENE-8635 > Project: Lucene - Core > Issue Type: New Feature > Components: core/FSTs > Environment: I used below setup for es_rally tests: > single node i3.xlarge running ES 6.5 > es_rally was running on another i3.xlarge instance >Reporter: Ankit Jain >Priority: Major > Attachments: offheap.patch, rally_benchmark.xlsx > > > Currently, FST loads all the terms into heap memory during index open. This > causes frequent JVM OOM issues if the term size gets big. A better way of > doing this will be to lazily load FST using mmap. That ensures only the > required terms get loaded into memory. > > Lucene can expose API for providing list of fields to load terms offheap. I'm > planning to take following approach for this: > # Add a boolean property fstOffHeap in FieldInfo > # Pass list of offheap fields to lucene during index open (ALL can be > special keyword for loading ALL fields offheap) > # Initialize the fstOffHeap property during lucene index open > # FieldReader invokes default FST constructor or OffHeap constructor based > on fstOffHeap field > > I created a patch (that loads all fields offheap), did some benchmarks using > es_rally and results look good. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap
[ https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740757#comment-16740757 ] Michael McCandless commented on LUCENE-8635: Wow, this is impressive! Surprising how small the change was – basically opening up the FST BytesStore API a bit so that we could have an impl that wraps an {{IndexInput}} (reading backwards) instead of a {{byte[]}}. Can you copy/paste the rally results out of Excel here? I'm curious what search-time impact you're seeing. If it not too much of an impact maybe we should consider just moving FSTs off-heap in the default codec? We've done similar things recently for Lucene ... e.g. moving norms off heap. I'll run Lucene's wikipedia benchmarks to measure the impact from our standard benchmarks (the nightly Lucene benchmarks). > Lazy loading Lucene FST offheap using mmap > -- > > Key: LUCENE-8635 > URL: https://issues.apache.org/jira/browse/LUCENE-8635 > Project: Lucene - Core > Issue Type: New Feature > Components: core/FSTs > Environment: I used below setup for es_rally tests: > single node i3.xlarge running ES 6.5 > es_rally was running on another i3.xlarge instance >Reporter: Ankit Jain >Priority: Major > Attachments: offheap.patch, rally_benchmark.xlsx > > > Currently, FST loads all the terms into heap memory during index open. This > causes frequent JVM OOM issues if the term size gets big. A better way of > doing this will be to lazily load FST using mmap. That ensures only the > required terms get loaded into memory. > > Lucene can expose API for providing list of fields to load terms offheap. I'm > planning to take following approach for this: > # Add a boolean property fstOffHeap in FieldInfo > # Pass list of offheap fields to lucene during index open (ALL can be > special keyword for loading ALL fields offheap) > # Initialize the fstOffHeap property during lucene index open > # FieldReader invokes default FST constructor or OffHeap constructor based > on fstOffHeap field > > I created a patch (that loads all fields offheap), did some benchmarks using > es_rally and results look good. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8601) Adding attributes to IndexFieldType
[ https://issues.apache.org/jira/browse/LUCENE-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734103#comment-16734103 ] Michael McCandless commented on LUCENE-8601: Thanks [~muralikpbhat] – I'll review and push to 7.x! > Adding attributes to IndexFieldType > --- > > Key: LUCENE-8601 > URL: https://issues.apache.org/jira/browse/LUCENE-8601 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 7.5 >Reporter: Murali Krishna P >Priority: Major > Attachments: 7x_LUCENE-8601.06.patch, LUCENE-8601.01.patch, > LUCENE-8601.02.patch, LUCENE-8601.03.patch, LUCENE-8601.04.patch, > LUCENE-8601.05.patch, LUCENE-8601.06.patch, LUCENE-8601.patch > > > Today, we can write a custom Field using custom IndexFieldType, but when the > DefaultIndexingChain converts [IndexFieldType to > FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L662], > only few key informations such as indexing options and doc value type are > retained. The [Codec gets the > FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/DocValuesConsumer.java#L90], > but not the type details. > > FieldInfo has support for ['attributes'| > https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/FieldInfo.java#L47] > and it would be great if we can add 'attributes' to IndexFieldType also and > copy it to FieldInfo's 'attribute'. > > This would allow someone to write a custom codec (extending docvalueformat > for example) for only the 'special field' that he wants and delegate the rest > of the fields to the default codec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8601) Adding attributes to IndexFieldType
[ https://issues.apache.org/jira/browse/LUCENE-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733641#comment-16733641 ] Michael McCandless commented on LUCENE-8601: Hi [~muralikpbhat], I pushed the change to master, thanks! But the {{git cherry-pick}} back to 7.x was not clean – could you fixup the patch to apply to 7.x as well? Also, the test case uses FieldInfos API that was never back-ported to 7.x ({{getMergedFieldInfos}}). Also, staring at the code shortly after I pushed I noticed that the field type's attributes will be saved into FieldInfo the first time that field is seen for a given segment, but subsequent times it looks like we will fail to copy the attributes again? Can you also add a test case exposing this bug, and then fixing it? We can do that on a follow-on issue ... thanks! > Adding attributes to IndexFieldType > --- > > Key: LUCENE-8601 > URL: https://issues.apache.org/jira/browse/LUCENE-8601 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 7.5 >Reporter: Murali Krishna P >Priority: Major > Attachments: LUCENE-8601.01.patch, LUCENE-8601.02.patch, > LUCENE-8601.03.patch, LUCENE-8601.04.patch, LUCENE-8601.05.patch, > LUCENE-8601.06.patch, LUCENE-8601.patch > > > Today, we can write a custom Field using custom IndexFieldType, but when the > DefaultIndexingChain converts [IndexFieldType to > FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L662], > only few key informations such as indexing options and doc value type are > retained. The [Codec gets the > FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/DocValuesConsumer.java#L90], > but not the type details. > > FieldInfo has support for ['attributes'| > https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/FieldInfo.java#L47] > and it would be great if we can add 'attributes' to IndexFieldType also and > copy it to FieldInfo's 'attribute'. > > This would allow someone to write a custom codec (extending docvalueformat > for example) for only the 'special field' that he wants and delegate the rest > of the fields to the default codec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8621) Move LatLonShape out of sandbox
[ https://issues.apache.org/jira/browse/LUCENE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16731697#comment-16731697 ] Michael McCandless commented on LUCENE-8621: +1 > Move LatLonShape out of sandbox > --- > > Key: LUCENE-8621 > URL: https://issues.apache.org/jira/browse/LUCENE-8621 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > LatLonShape has matured a lot over the last months, I'd like to start > thinking about moving it out of sandbox so that it doesn't stay there for too > long like what happened to LatLonPoint. I am pretty happy with the current > encoding. To my knowledge, we might just need to do a minor modification > because of > LUCENE-8620. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8601) Adding attributes to IndexFieldType
[ https://issues.apache.org/jira/browse/LUCENE-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16731696#comment-16731696 ] Michael McCandless commented on LUCENE-8601: Thanks I will review and push soon! > Adding attributes to IndexFieldType > --- > > Key: LUCENE-8601 > URL: https://issues.apache.org/jira/browse/LUCENE-8601 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 7.5 >Reporter: Murali Krishna P >Priority: Major > Attachments: LUCENE-8601.01.patch, LUCENE-8601.02.patch, > LUCENE-8601.03.patch, LUCENE-8601.04.patch, LUCENE-8601.05.patch, > LUCENE-8601.06.patch, LUCENE-8601.patch > > > Today, we can write a custom Field using custom IndexFieldType, but when the > DefaultIndexingChain converts [IndexFieldType to > FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L662], > only few key informations such as indexing options and doc value type are > retained. The [Codec gets the > FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/DocValuesConsumer.java#L90], > but not the type details. > > FieldInfo has support for ['attributes'| > https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/FieldInfo.java#L47] > and it would be great if we can add 'attributes' to IndexFieldType also and > copy it to FieldInfo's 'attribute'. > > This would allow someone to write a custom codec (extending docvalueformat > for example) for only the 'special field' that he wants and delegate the rest > of the fields to the default codec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8601) Adding attributes to IndexFieldType
[ https://issues.apache.org/jira/browse/LUCENE-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16731288#comment-16731288 ] Michael McCandless commented on LUCENE-8601: Ahh OK thanks [~muralikpbhat]; that makes sense, so let's leave the assertion out. > Adding attributes to IndexFieldType > --- > > Key: LUCENE-8601 > URL: https://issues.apache.org/jira/browse/LUCENE-8601 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 7.5 >Reporter: Murali Krishna P >Priority: Major > Attachments: LUCENE-8601.01.patch, LUCENE-8601.02.patch, > LUCENE-8601.03.patch, LUCENE-8601.04.patch, LUCENE-8601.05.patch, > LUCENE-8601.patch > > > Today, we can write a custom Field using custom IndexFieldType, but when the > DefaultIndexingChain converts [IndexFieldType to > FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L662], > only few key informations such as indexing options and doc value type are > retained. The [Codec gets the > FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/DocValuesConsumer.java#L90], > but not the type details. > > FieldInfo has support for ['attributes'| > https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/FieldInfo.java#L47] > and it would be great if we can add 'attributes' to IndexFieldType also and > copy it to FieldInfo's 'attribute'. > > This would allow someone to write a custom codec (extending docvalueformat > for example) for only the 'special field' that he wants and delegate the rest > of the fields to the default codec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8601) Adding attributes to IndexFieldType
[ https://issues.apache.org/jira/browse/LUCENE-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16730210#comment-16730210 ] Michael McCandless commented on LUCENE-8601: Thanks [~muralikpbhat]! Maybe add javadocs about {{ignoreCurrentFormat}} parameter? Can you use multi-line {{if}} statement instead of ternary operator? I think the changes to {{PerFieldPostingsFormat}} are OK, except instead of removing the check that there was no format there and blindly overwriting it, can you change that to check that either it wasn't there (what it checks now) or, if it is there, that the value for the attributes match what that postings format wants to write? No need to initialize class members with {{= null}}; that's already the default for java. In {{DefaultIndexingChain}} can you use a local variable for the {{fieldType.getAttributes()}} in the two places where you reference it? > Adding attributes to IndexFieldType > --- > > Key: LUCENE-8601 > URL: https://issues.apache.org/jira/browse/LUCENE-8601 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 7.5 >Reporter: Murali Krishna P >Priority: Major > Attachments: LUCENE-8601.01.patch, LUCENE-8601.02.patch, > LUCENE-8601.03.patch, LUCENE-8601.04.patch, LUCENE-8601.05.patch, > LUCENE-8601.patch > > > Today, we can write a custom Field using custom IndexFieldType, but when the > DefaultIndexingChain converts [IndexFieldType to > FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L662], > only few key informations such as indexing options and doc value type are > retained. The [Codec gets the > FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/DocValuesConsumer.java#L90], > but not the type details. > > FieldInfo has support for ['attributes'| > https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/FieldInfo.java#L47] > and it would be great if we can add 'attributes' to IndexFieldType also and > copy it to FieldInfo's 'attribute'. > > This would allow someone to write a custom codec (extending docvalueformat > for example) for only the 'special field' that he wants and delegate the rest > of the fields to the default codec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8601) Adding attributes to IndexFieldType
[ https://issues.apache.org/jira/browse/LUCENE-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725297#comment-16725297 ] Michael McCandless commented on LUCENE-8601: Hmm, I'm concerned that segment merging may not preserve the attributes. [~muralikpbhat] could you please add a test case that forces merging? E.g. index one document with attributes, commit (so it writes a segment), index another without attributes, commit, and confirm attributes survived? Can you also update the javadocs to state that if you try to index conflicting attributes the behavior is undefined (i.e. which attribute wins is undefined). > Adding attributes to IndexFieldType > --- > > Key: LUCENE-8601 > URL: https://issues.apache.org/jira/browse/LUCENE-8601 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 7.5 >Reporter: Murali Krishna P >Priority: Major > Attachments: LUCENE-8601.01.patch, LUCENE-8601.02.patch, > LUCENE-8601.03.patch, LUCENE-8601.04.patch, LUCENE-8601.patch > > > Today, we can write a custom Field using custom IndexFieldType, but when the > DefaultIndexingChain converts [IndexFieldType to > FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L662], > only few key informations such as indexing options and doc value type are > retained. The [Codec gets the > FieldInfo|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/DocValuesConsumer.java#L90], > but not the type details. > > FieldInfo has support for ['attributes'| > https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/FieldInfo.java#L47] > and it would be great if we can add 'attributes' to IndexFieldType also and > copy it to FieldInfo's 'attribute'. > > This would allow someone to write a custom codec (extending docvalueformat > for example) for only the 'special field' that he wants and delegate the rest > of the fields to the default codec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org