ConjunctionScorer.doNext() overstays?
Due to the odd behaviour of a custom Scorer of mine I discovered ConjunctionScorer.doNext() could loop indefinitely. It does not bail out as soon as any scorer.advance() call it makes reports back "NO_MORE_DOCS". Is there not a performance optimisation to be gained in exiting as soon as this happens? At this stage I cannot see any point in continuing to advance other scorers - a quick look at TermScorer suggests that any questionable calls made by ConjunctionScorer to advance to NO_MORE_DOCS receives no special treatment and disk will be hit as a consequence. I added an extra condition to the while loop on the 3.5 source: while ((doc != NO_MORE_DOCS) && ((firstScorer = scorers[first]).docID() < doc)) { and Junit tests passed.I haven't been able to benchmark performance improvements but it looks like it would be sensible to make the change anyway. Cheers, Mark - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: ConjunctionScorer.doNext() overstays?
I got round to some benchmarking of this change on Wikipedia content which shows a small improvement: http://goo.gl/60wJG Aside from the small performance gain to be had, it just feels more logical if ConjunctionScorer does not issue sub scorers with a request to advance to "NO_MORE_DOCS". - Original Message ----- From: mark harwood To: "dev@lucene.apache.org" Cc: Sent: Thursday, 1 March 2012, 9:39 Subject: ConjunctionScorer.doNext() overstays? Due to the odd behaviour of a custom Scorer of mine I discovered ConjunctionScorer.doNext() could loop indefinitely. It does not bail out as soon as any scorer.advance() call it makes reports back "NO_MORE_DOCS". Is there not a performance optimisation to be gained in exiting as soon as this happens? At this stage I cannot see any point in continuing to advance other scorers - a quick look at TermScorer suggests that any questionable calls made by ConjunctionScorer to advance to NO_MORE_DOCS receives no special treatment and disk will be hit as a consequence. I added an extra condition to the while loop on the 3.5 source: while ((doc != NO_MORE_DOCS) && ((firstScorer = scorers[first]).docID() < doc)) { and Junit tests passed.I haven't been able to benchmark performance improvements but it looks like it would be sensible to make the change anyway. Cheers, Mark - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: ConjunctionScorer.doNext() overstays?
I would have assumed the many int comparisons would cost less than the superfluous disk accesses? (I bow to your considerable experience in this area!) What is the worst-case scenario on added disk reads? Could it be as bad as numberOfSegments x numberOfOtherscorers before the query winds up? On the index I tried, it looked like an improvement - the spreadsheet I linked to has the source for the benchmark on a second worksheet if you want to give it a whirl on a different dataset. - Original Message - From: Michael McCandless To: dev@lucene.apache.org; mark harwood Cc: Sent: Thursday, 1 March 2012, 13:31 Subject: Re: ConjunctionScorer.doNext() overstays? Hmm, the tradeoff is an added per-hit check (doc != NO_MORE_DOCS), vs the one-time cost at the end of calling advance(NO_MORE_DOCS) for each sub-clause? I think in general this isn't a good tradeoff? Ie what about the case where we and high-freq, and similarly freq'd, terms together? Then, the per-hit check will at some point dominate? It's valid to pass NO_MORE_DOCS to DocsEnum.advance. Mike McCandless http://blog.mikemccandless.com On Thu, Mar 1, 2012 at 7:22 AM, mark harwood wrote: > I got round to some benchmarking of this change on Wikipedia content which > shows a small improvement: http://goo.gl/60wJG > > Aside from the small performance gain to be had, it just feels more logical > if ConjunctionScorer does not issue sub scorers with a request to advance to > "NO_MORE_DOCS". > > > > > - Original Message - > From: mark harwood > To: "dev@lucene.apache.org" > Cc: > Sent: Thursday, 1 March 2012, 9:39 > Subject: ConjunctionScorer.doNext() overstays? > > Due to the odd behaviour of a custom Scorer of mine I discovered > ConjunctionScorer.doNext() could loop indefinitely. > It does not bail out as soon as any scorer.advance() call it makes reports > back "NO_MORE_DOCS". Is there not a performance optimisation to be gained in > exiting as soon as this happens? > At this stage I cannot see any point in continuing to advance other scorers - > a quick look at TermScorer suggests that any questionable calls made by > ConjunctionScorer to advance to NO_MORE_DOCS receives no special treatment > and disk will be hit as a consequence. > I added an extra condition to the while loop on the 3.5 source: > > while ((doc != NO_MORE_DOCS) && ((firstScorer = scorers[first]).docID() > < doc)) { > > and Junit tests passed.I haven't been able to benchmark performance > improvements but it looks like it would be sensible to make the change anyway. > > Cheers, > Mark > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: ConjunctionScorer.doNext() overstays?
Fair points. I've tried several sized indexes and blends of query term frequencies now and the results swing only marginally between the 2 implementations. Sometimes the "exiting early" logic is marginally faster and other times marginally slower. Using a larger index seemed to reduce the improvement I had seen on my initial results. So overall, not a clear improvement and not worth bothering with because, as you suggest, various disk caching strategies probably mitigate the cost of the added reads. Based on your comments re the added int comparison cost in that "hot" loop it made me think that the abstract docIdSetIterator.docId() method call could be questioned on that basis too? It looks like all DocIdSetIterator subclasses maintain a doc variable mutated elsewhere in advance() and next() calls and docID() is meant to be idempotent so presumably a shared variable in the base class could avoid a docID() method invocation? Anyhoo the profiler did not show that method up as any sort of hotspot so I don't think it's an issue. Thanks, Mike. - Original Message - From: Michael McCandless To: dev@lucene.apache.org; mark harwood Cc: Sent: Thursday, 1 March 2012, 14:18 Subject: Re: ConjunctionScorer.doNext() overstays? On Thu, Mar 1, 2012 at 8:49 AM, mark harwood wrote: > I would have assumed the many int comparisons would cost less than the > superfluous disk accesses? (I bow to your considerable experience in this > area!) > What is the worst-case scenario on added disk reads? Could it be as bad > as numberOfSegments x numberOfOtherscorers before the query winds up? Well, it depends -- the disk access is a one-time thing but the added per-hit check is per-hit. At some point it'll cross over... I think likely the advance(NO_MORE_DOCS) will not usually hit disk: our skipper impl fully pre-buffers (in RAM) the top skip lists I think? Even if we do go to disk it's likely the OS pre-cached those bytes in its IO buffer. > On the index I tried, it looked like an improvement - the spreadsheet I > linked to has the source for the benchmark on a second worksheet if you want > to give it a whirl on a different dataset. Maybe try it on a more balanced case? Ie, N high-freq terms whose freq is "close-ish"? And on slow queries (I think the results in your spreadsheet are very fast queries right? The slowest one was ~0.95 msec per query, if I'm reading it right?). In general I think not slowing down the worst-case queries is much more important that speeding up the super-fast queries. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: ConjunctionScorer.doNext() overstays?
> Ideally, consumers of DISI should hold onto the int docID returned > from next/advance and use that... (ie, don't call docID() again, > unless it's too hard to hold onto the returned doc). > Yes, I remember raising that way back when: https://issues.apache.org/jira/browse/LUCENE-584?focusedCommentId=12564415&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12564415 Back then Mike B raised the issue of backwards compatibility so I don't know if the 4.0 release presents the opportunity to revisit that idea - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GSOC 2012?
>>Does anyone have any ideas? A framework for match metadata? Similar to the way tokenization was changed to allow tokenizers to to enrich a stream of tokens with arbitrary "attributes", Scorers could provide "MatchAttributes" to provide arbitrary metadata about the stream of matches they produce. Same model is used - callers decide in advance which attribute decorations they want to consume and Scorers modify a singleton object which can be cloned if multiple attributes need to be retained by the caller. Helps support highlighting, explain and enables communication of added information between query objects in the tree. LUCENE-1999 was an example of a horrible work-around where additional match information that was required was smuggled through by bit-twiddling the score - this is because score is the only bit of match context we currently pass in Lucene APIs. Cheers Mark From: Robert Muir To: dev@lucene.apache.org Sent: Friday, 2 March 2012, 10:30 Subject: GSOC 2012? Hello, I was asked by a student if we are participating in GSOC this year. I hope the answer is yes? If we are planning to, I think it would be good if we came up with a list on the wiki of potential tasks. Does anyone have any ideas? One suggested idea I had (similar to LUCENE-2959 last year) would be to add a flexible query expansion framework. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Proposal - a high performance Key-Value store based on Lucene APIs/concepts
I've been spending quite a bit of time recently benchmarking various Key-Value stores for a demanding project and been largely disappointed with results However, I have developed a promising implementation based on these concepts: http://www.slideshare.net/MarkHarwood/lucene-kvstore The code needs some packaging before I can release it but the slide deck should give a good overview of the design. Is this something that it is likely to be of interest as a contrib module here? I appreciate this is a departure from the regular search focus but it builds on some common ground in Lucene core and may have some applications here. Cheers, Mark - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts
> Mark, can you share more on what K-V (NoSQL) stores have you've been > benchmarking and what have been the results? > Mongo, Cassandra, Krati, Bdb a Java version of BitCask, Lucene, MySQL I was interested in benchmarking the single-server stores rather than a distributed setup because your choice of store could be plugged into the likes of Voldemort for scale out. The design is similar to the Bitcask paper but keeps only hashes of keys in ram not the full key. My implementation was the only store that didn't degrade noticeably as you get into 10s of millions of keys in the store. > Did you try all the well known ones? > http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis > > -- J > > On Thu, Mar 22, 2012 at 10:42 AM, mark harwood > wrote: > I've been spending quite a bit of time recently benchmarking various > Key-Value stores for a demanding project and been largely disappointed with > results > However, I have developed a promising implementation based on these concepts: > http://www.slideshare.net/MarkHarwood/lucene-kvstore > > The code needs some packaging before I can release it but the slide deck > should give a good overview of the design. > > > Is this something that it is likely to be of interest as a contrib module > here? > I appreciate this is a departure from the regular search focus but it builds > on some common ground in Lucene core and may have some applications here. > > Cheers, > Mark > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts
> > Random question: Do you basically end up with something very similar to > LevelDB that many people where talking about a few weeks ago ? > Haven't looked at LevelDB because I was concentrating on Java implementations. Riak's Bitcask is the most similar in principle but I didn't like the idea of holding keys in RAM. See http://downloads.basho.com/papers/bitcask-intro.pdf - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts
OK I have some code and benchmarks for this solution up on a Google Code project here: http://code.google.com/p/graphdb-load-tester/ The project exists to address the performance challenges I have encountered when dealing with large graphs. It uses all of the Wikipedia links as a test dataset and a choice of graph databases (most of which use Lucene BTW). The test data is essentially 130 million edges representing links between pages e.g. Communism->Russia. To load the data all of the graph databases have to translate user-defined keys like "Russia" into an internally-generated node ID using a service that looks like this: interface KeyService { //Returns existing nodeid or -1 if is not already in store public long getGraphNodeId(String udk); //Adds a new record - assumption is client has checked user defined key (udk) is not stored already using getGraphNodeId public void put(String udk, long graphNodeId); } This is a challenge on a dataset of this size. I tried using a Lucene-based implementation for this service with the following optimisations: 1) a Bloomfilter to quickly "know what we don't know" 2) an LRUCache to hold on to commonly referenced vertices e.g the Wikipdedia article for "United States" 3) a hashmap representing the unflushed state of Lucene's IndexWriter to avoid the need for excessive flushing with NRT reader etc The search/write performance showed the familiar saw-toothing as the Lucene index grew in size and merge operations kicked in. The KVStore implementation I wrote attempts to tackle this problem using a fundamentally different form of index. The results from the KVStore runs show it was twice as fast as this Lucene solution and maintains constant performance without the saw toothing effect. Benchmark figures are here: http://goo.gl/VQ027 The KVStore source code is here: http://goo.gl/ovkop and the Lucene implementation I compare against is also in the project. Cheers Mark - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Continuous stream indexing and time-based segment management
There are a number of scenarios where Lucene might be used to index a fixed time range on a continuous stream of data e.g. a news feed. In these scenarios I imagine the following facilities would be useful: a) A MergePolicy that organized content into segments on the basis of increasing time units e.g. 5min->10 min->1 hour->1 day b) The ability to drop entire segments e.g. the day-level segment from exactly a week ago c) Various new analysis functions comparing term frequencies across time e.g discovery of "trending" topics. I can see that a) could be implemented using a custom MergePolicy and c) can be done via existing APIs but I'm not sure if there is way to simply drop entire segments currently? Anyone else had thoughts in this area? Cheers Mark - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Continuous stream indexing and time-based segment management
> you can do that by subclassing IW and call some package private APIs / To date I have used separate physical indexes with a MultiReader to combine them then dropping the outdated indexes. At least this has the benefit that a custom MergePolicy is not required to keep content from the different dates segregated. Where I saw the potential is when looking at S4 or Esper stream processing technologies when they try to count things in time windows. It struck me that careful organisation of Lucene segments along time units could provide an efficient means of accessing and comparing counts of many things over time. It looked like the "Hello World' example in S4 for counting top Twitter topics instantiated a Java object per unique topic String which was then responsible for maintaining counts on things - this seems a fairly inefficient way of modelling things. >>If you are willing/able to close the IndexWriter, it's easy to drop segments >>by reading the SegmentInfos, editing, and writing back. My assumption was that ultimately that's what it comes down to - I just wonder if this is likely to be a common requirement, deserving of a supported API > members. We can certainly make that easier but I personally don't want > to open this as a public API. I can certainly imagine to have a > protected API that allows dropping entire segment. > > simon > >> c) Various new analysis functions comparing term frequencies across time e.g >> discovery of "trending" topics. >> >> I can see that a) could be implemented using a custom MergePolicy and c) can >> be done via existing APIs but I'm not sure if there is way to simply drop >> entire segments currently? >> >> Anyone else had thoughts in this area? I had some ideas to add statistics to DocValues that get created during index time. You can already do that and expose it via Attributes maybe we can add some API to docvlaues you can hook into so that you don't need to write you own DV impl. >> >> Cheers >> Mark >> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Welcome Greg Bowyer
Good to have you aboard, Greg! - Original Message - From: Erick Erickson To: dev@lucene.apache.org Cc: Sent: Thursday, 21 June 2012, 11:56 Subject: Welcome Greg Bowyer I'm pleased to announce that Greg Bowyer has been added as a Lucene/Solr committer. Greg: It's a tradition that you reply with a brief bio. Your SVN access should be set up and ready to go. Congratulations! Erick Erickson - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Adding another dimension to Lucene searches
I have been working on a hierarchical search capability for a while now and wanted to see if there was general interest in adopting some of the thinking into Lucene. The idea needs a little explanation so I've put some slides up here to kick things off: http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene Cheers Mark - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Adding another dimension to Lucene searches
OK, seems like there is some interest. I'll work on packaging the code/unit tests/demos and make it available. > matching ids ... but I didn't quite catch from the slides how you encode > the parent-child link... is it just "the next docs are sub-documents > until the next parent doc"? Yes - using physical proximity avoids any kind of costly look-ups and allows efficient streaming/skipTo logic to work as per usual. The downside is the need to maintain sequences of related docs in the same segment - something Lucene currently doesn't make easy with its limited control over when segments are flushed. I suspect we'll need some discussion on how best to support this. Another dependency is that Lucene maintains sequencing of documents when merging segments together - this is something I think we can rely on currently (please correct me if I'm wrong) but I would like to formalise this with a Junit test or some other form of commitment which guarantees this state of affairs. Cheers Mark On 8 May 2010, at 08:32, Andrzej Bialecki wrote: > On 2010-05-07 18:25, mark harwood wrote: >> I have been working on a hierarchical search capability for a while now and >> wanted to see if there was general interest in adopting some of the thinking >> into Lucene. >> >> The idea needs a little explanation so I've put some slides up here to kick >> things off: >> >> http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene > > Very cool stuff. If I understand the design correctly, the cost of the > query is roughly the same as constructing a Filter Query from the parent > query, and then executing the child query with this filter. You probably > use childScorer.skipTo(nextParentId) to avoid actually traversing all > matching ids ... but I didn't quite catch from the slides how you encode > the parent-child link... is it just "the next docs are sub-documents > until the next parent doc"? or is it a field in the children that points > to a unique id field of the parent? > > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Adding another dimension to Lucene searches
I've put up code, example data and tests for the Nested Document feature here: http://www.inperspective.com/lucene/LuceneNestedDocumentSupport.zip The data used in the unit tests is chosen to illustrate practical use of real-world content. The final unit tests will work on more abstract data for more formal/exhaustive testing of functionality. This packaging changes no existing Lucene code and is bundled with 3.0.1 but should work with 2.9.1. The readme.txt highlights the issues with segment flushing that may need addressing before adoption. Cheers Mark - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Adding another dimension to Lucene searches
Having implemented this code on a few projects I find that the major challenge shifts from the back end to the problem of the front end and how to get end users to articulate the questions Lucene can answer with this. Certainly an interesting challenge but that's another topic... - Original Message From: J. Delgado To: dev@lucene.apache.org Sent: Mon, 10 May, 2010 16:47:50 Subject: Re: Adding another dimension to Lucene searches Hierachical documents is a key concept towads a unified structured+unstructured search. It should allow us to fully implement things such as XQuery + Full-Text (http://www.w3.org/TR/xquery-full-text/) Additionally it solves a century old problem: how to deal with section/sub-sections in very large documents. Long time ago I was indexing text books (in PDF) and had to break down the book into pages and store the main doc id in a field as pointer to maintain the relation. Mark, way to go! -- Joaquin On Mon, May 10, 2010 at 8:03 AM, Grant Ingersoll wrote: > Very cool stuff, Mark. > > Can you just open a JIRA and attach there? > > On May 10, 2010, at 8:38 AM, mark harwood wrote: > >> I've put up code, example data and tests for the Nested Document feature >> here: http://www.inperspective.com/lucene/LuceneNestedDocumentSupport.zip >> >> The data used in the unit tests is chosen to illustrate practical use of >> real-world content. >> The final unit tests will work on more abstract data for more >> formal/exhaustive testing of functionality. >> >> This packaging changes no existing Lucene code and is bundled with 3.0.1 but >> should work with 2.9.1. The readme.txt highlights the issues with segment >> flushing that may need addressing before adoption. >> >> >> Cheers >> Mark >> >> >> >> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Web-Based Luke
See http://search.lucidimagination.com/search/document/63cef9e98692a126/webluke_include_jetty_in_lucene_binary_distribution There's a link to a zip file with source there which should still be available. On 9 Jul 2010, at 15:14, Mark Miller wrote: > Did the GWT version of Luke that Mark Harwood started ever get dumped to > JIRA or anything? All I can find is a link to a war, but not the source. > Mark? Anyone? > > - Mark > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Web-Based Luke
I had concerns about the bloat the gwt compiler would add to the source distro. If of interest it could do with upgrading to the latest gwt. Ideally all Luke front ends (swing/gwt/thinlet) would share the same back end api. Decoupling from thinlet as done in this webluke code is the first step down the road to a version of Luke that is apache-license-friendly and can therefore be maintained by the apache community. --- On 12 Jul 2010, at 00:02, John Wang wrote: > Mark: > >This is a super useful tool! > >Any plans of putting it under lucene contrib? > > Thanks > > -John > > On Fri, Jul 9, 2010 at 7:26 AM, Mark Harwood wrote: > See > http://search.lucidimagination.com/search/document/63cef9e98692a126/webluke_include_jetty_in_lucene_binary_distribution > > There's a link to a zip file with source there which should still be > available. > > > > On 9 Jul 2010, at 15:14, Mark Miller wrote: > > > Did the GWT version of Luke that Mark Harwood started ever get dumped to > > JIRA or anything? All I can find is a link to a war, but not the source. > > Mark? Anyone? > > > > - Mark > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
Re: Web-Based Luke
Agreed. I think apache is a preferable home. The major change to Luke in providing a Luke core api is the need to be remotable i.e. Use of an interface and serializable data objects used for args. Gwt rpc should take care of the marshalling and I've used similar frameworks for applet clients. Like Andrzej I have limited time to work on this though. :( On 12 Jul 2010, at 08:54, Andrzej Bialecki wrote: > On 2010-07-12 09:14, John Wang wrote: >> share FE with luke is defn a good idea. >> >> any thoughts on putting webluke up on goog code or github? > > Guys, if you want to move forward with webluke, I think it's better to do > this under Lucene contrib. The reason is that if there's a substantial > development done outside Apache then it will need a code grant, and also it's > more difficult for other Lucene committers to participate in the outside > development and to bring its results back to ASF. > > I'm perfectly willing to donate all Luke's code to ASF, as I've said many > times in the past, if there's any chance of someone stepping in and removing > the Thinlet dependency. I'm also willing to work together as a Lucene > committer on an abstracted Luke core, if not on the GWT front-end (I don't > know GWT and I have too little time to learn it now). > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Nested Document support in Lucene
>AFAIK this is still under heavy development and it doesn't seem to be ready in >the near future. It's stable as far as I'm concerned. Lucene-2454 includes the code and Junit tests that work with the latest 3.0.3 release. I have versions of this running in production with 2.4 and 2.9-based releases. The only concern for users is the need to carefully control when flushing occurs and the accompanying readme.txt gives advice on how to achieve this. From: Kapil Charania To: simon.willna...@gmail.com Cc: Simon Willnauer ; dev@lucene.apache.org Sent: Tue, 22 March, 2011 9:12:20 Subject: Re: Nested Document support in Lucene May I know in which release will it ready to use. On Sat, Mar 19, 2011 at 2:23 PM, Simon Willnauer wrote: On Sat, Mar 19, 2011 at 9:39 AM, Kapil Charania > wrote: >> Hi, >> >> I am a newbie to Lucene. I have already created indexes for my project. But >> now requirement is to go with Nested Document. I googled a lot but can not >> find much implementation of nested documents. >> >> My I know if its already implemented in any release of Lucene. >> >> Thanks in Advances !!! > >AFAIK this is still under heavy development and it doesn't seem to be >ready in the near future. I has not yet been released. > >simon >> >> -- >> Kapil Charania. >> > -- Kapil Charania.
Re: revisit naming for grouping/join?
>> I think what would be best is a smallish but feature complete demo, For the nested stuff I had a reasonable demo on LUCENE-2454 that was based around resumes - that use case has the one-to-many characteristics that lends itself to nested e.g. a person has many different qualifications and records of employment. This scenario was illustrated here: http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene I also had the "book search" type scenario where a book has many sections and for the purposes of efficient highlighting/summarisation these sections were treated as child docs which could be read quickly (rather than highlighting a whole book) I'm not sure what the "parent" was in your doctor and cities example, Mike. If a doctor is in only one city then there is no point making city a child doc as the one city info can happily be combined with the doctor info into a single document with no conflict (doctors have different properties to cities). If the city is the parent with many child doctor docs that makes more sense but feels like a less likely use case e.g. "find me a city with doctor x and a different doctor y" Searching for a person with excellent java and prefrerably good lucene skills feels like a more real-world example. It feels like documenting some of the trade-offs behind index design choices is useful too e.g. nesting is not too great for very volatile content with constantly changing children while search-time join is more costly in RAM and 2-pass processing Cheers Mark - Original Message From: Michael McCandless To: dev@lucene.apache.org Sent: Fri, 1 July, 2011 13:51:04 Subject: Re: revisit naming for grouping/join? I think joining and grouping are two different functions, and we should keep different modules for them... On Thu, Jun 30, 2011 at 10:30 PM, Robert Muir wrote: > Hi, > > when looking at just a very quick glance at some of the newer > grouping/join features, I found myself a little confused about what is > exactly what, and I think users might too. They are confusing! > I discussed some of this with hossman, and it only seemed to make me > even more totally confused about: > * difference between field collapsing and grouping I like the name grouping better here: I think field collapsing undersells (it's only one specific way to use grouping). EG, grouping w/o collapsing is useful (eg, Best Buy grouping hits by product category and showing the top 5 in each). > * difference between nested documents and the index-time join Similarly I think nested docs undersells index-time join: you can join (either during indexing or during searching) in many different ways, and nested docs is just one use case. EG, maybe your docs are doctors but during indexing you join to a city table with facts about that city (each doctor's office is in a specific city) and then you want to run queries like "city's avg annual temp > 60 and doctor has good bedside manner" or something. > * difference between index-time-join/nested documents and single-pass > index-time grouping. Is the former only a more general case of the > latter? Grouping is purely a presentation concern -- you are not altering which docs hit; you are simply changing how you pick which hits to display ("top N by group"). So we only have collectors here. The "generic" (requires 2 passes) collectors can group on anything at search time; the "doc block" collector requires that you indexed all docs in each group as a block. Join is both about restricting matches and also presentation of hits, because your query needs to match fields from different [logical] tables (so, the module has a Query and a Collector). When you get the results back, you may or may not be interested in retaining the table structure in your result set (ie, you may not have selected fields from the child table). Similarly, "generic" joining (in Solr/ElasticSearch today but I'd like to factor into the join module) can do any join at search time, while the "doc block" collector requires that you did the necessary join(s) during indexing. > * difference between the above joinish capabilities and solr's join > impl... other than the single-pass/index-time limitation (which is > really an implementation detail), I'm talking about use cases. Solr's/ElasticSearch's join is more general because you can join anything at search time (even, across 2 different indexes), vs doc block join where you must pick which joins you will ever want to use and then build the index accordingly. You can also mix the two. Maybe you do certain joins while indexing, but then at search time you do other joins "generically". That's fine. (Same is true for grouping). > I think its especially interesting since the join module depends on > the grouping module. The join module does currently depend on the grouping module, but for a silly reason: just for the TopGroups, to represent the returned hits. We could mov
BlockJoin concerns
I've been looking at the BlockJoin stuff in 3.4 in relation to children of multiple types and have a couple of concerns which are either issues, or my ignorance of the API: Concern #1 If I only retrieve children of type A all is well. If I only retrieve children of type B all is well. If I try retrieve children of type A and then B I get a null TopGroups returned for B. (test code for this at the end of this email) Concern #2 I'm not sure where I get to control how many children of type A and of Type B are returned per parent? BlockJoinCollector's constructor only controls how many parents are collected. *Post-search* I can call BlockJoinCollector'.getTopGroups(childQueryA,...maxDocsPerGroup..) to define how many children I get back. Does this imply if I ask for more child docs than are cached by the collector the search is somehow automatically repeated? If so, what would be the "default" number of child docs cached by the collector and where would I set that? Cheers Mark Below is the code I added to the existing TestBlockJoin which exercises the above. //= public void testMultiChildTypes() throws Exception { final Directory dir = newDirectory(); final RandomIndexWriter w = new RandomIndexWriter(random, dir); final List docs = new ArrayList(); docs.add(makeJob("java", 2007)); docs.add(makeJob("python", 2010)); docs.add(makeQualification("maths", 1999)); docs.add(makeResume("Lisa", "United Kingdom")); w.addDocuments(docs); IndexReader r = w.getReader(); w.close(); IndexSearcher s = new IndexSearcher(r); // Create a filter that defines "parent" documents in the index - in this case resumes Filter parentsFilter = new CachingWrapperFilter(new QueryWrapperFilter(new TermQuery(new Term("docType", "resume"; // Define child document criteria (finds an example of relevant work experience) BooleanQuery childJobQuery = new BooleanQuery(); childJobQuery.add(new BooleanClause(new TermQuery(new Term("skill", "java")), Occur.MUST)); childJobQuery.add(new BooleanClause(NumericRangeQuery.newIntRange("year", 2006, 2011, true, true), Occur.MUST)); BooleanQuery childQualificationQuery = new BooleanQuery(); childQualificationQuery.add(new BooleanClause(new TermQuery(new Term("qualification", "maths")), Occur.MUST)); childQualificationQuery.add(new BooleanClause(NumericRangeQuery.newIntRange("year", 1980, 2000, true, true), Occur.MUST)); // Define parent document criteria (find a resident in the UK) Query parentQuery = new TermQuery(new Term("country", "United Kingdom")); // Wrap the child document query to 'join' any matches // up to corresponding parent: BlockJoinQuery childJobJoinQuery = new BlockJoinQuery(childJobQuery, parentsFilter, BlockJoinQuery.ScoreMode.Avg); BlockJoinQuery childQualificationJoinQuery = new BlockJoinQuery(childQualificationQuery, parentsFilter, BlockJoinQuery.ScoreMode.Avg); // Combine the parent and nested child queries into a single query for a candidate BooleanQuery fullQuery = new BooleanQuery(); fullQuery.add(new BooleanClause(parentQuery, Occur.MUST)); fullQuery.add(new BooleanClause(childJobJoinQuery, Occur.MUST)); fullQuery.add(new BooleanClause(childQualificationJoinQuery, Occur.MUST)); //? How do I control volume of jobs vs qualifications per parent? BlockJoinCollector c = new BlockJoinCollector(Sort.RELEVANCE, 10, true, false); s.search(fullQuery, c); //Examine "Job" children boolean showNullPointerIssue=true; if(showNullPointerIssue) { TopGroups jobResults = c.getTopGroups(childJobJoinQuery, null, 0, 10, 0, true); //assertEquals(1, results.totalHitCount); assertEquals(1, jobResults.totalGroupedHitCount); assertEquals(1, jobResults.groups.length); final GroupDocs group = jobResults.groups[0]; assertEquals(1, group.totalHits); Document childJobDoc = s.doc(group.scoreDocs[0].doc); //System.out.println(" doc=" + group.scoreDocs[0].doc); assertEquals("java", childJobDoc.get("skill")); assertNotNull(group.groupValue); Document parentDoc = s.doc(group.groupValue); assertEquals("Lisa", parentDoc.get("name")); } //Now Examine qualification children TopGroups qualificationResults = c.getTopGroups(childQualificationJoinQuery, null, 0, 10, 0, true); //! This next line can null pointer - but only if prior "jobs" section called first assertEquals(1, qualificationResults.totalGroupedHitCount); assertEquals(1, qualificationResults.groups.length); final GroupDocs qGroup = qualificationResults.groups[0]; assertEquals(1, qGroup.totalHits); Document childQualificationDoc = s.doc(qGroup.scoreDocs[0].doc); assertEquals("maths", childQualificationDoc.get("qualification")); assertNotNull(qGroup.groupValue); Document parentDoc = s.doc(qGroup.groupValu
Re: BlockJoin concerns
>>I opened LUCENE-3519 for the unexpected null when pulling the >>TopGroups, and added your test case (thanks!). Great, thanks. >>the collector internally gathers all child docIDs for a given collected >>parent docID OK - I guess that scales OK because the numbers of docIDs per parent is naturally limited by the number of docs you can hold in RAM as part of the original IW.addDocuments call - i.e. not in the millions. Cheers, Mark - Original Message - From: Michael McCandless To: dev@lucene.apache.org; mark harwood Cc: Sent: Friday, 14 October 2011, 13:56 Subject: Re: BlockJoin concerns Hi Mark, I opened LUCENE-3519 for the unexpected null when pulling the TopGroups, and added your test case (thanks!). On Concern #2, this is not limited today: the collector internally gathers all child docIDs for a given collected parent docID, and only in the end when ask for the top groups does it sort the child docIDs within each group and keep the topN you passed to it. Mike McCandless http://blog.mikemccandless.com On Fri, Oct 14, 2011 at 7:09 AM, mark harwood wrote: > I've been looking at the BlockJoin stuff in 3.4 in relation to children of > multiple types and have a couple of concerns which are either issues, or my > ignorance of the API: > > Concern #1 > > If I only retrieve children of type A all is well. > > If I only retrieve children of type B all is well. > If I try retrieve children of type A and then B I get a null TopGroups > returned for B. > (test code for this at the end of this email) > > > Concern #2 > > I'm not sure where I get to control how many children of type A and of Type B > are returned per parent? > BlockJoinCollector's constructor only controls how many parents are collected. > > *Post-search* I can > call BlockJoinCollector'.getTopGroups(childQueryA,...maxDocsPerGroup..) to > define how many children I get back. Does this imply if I ask for more child > docs than are cached by the collector the search is somehow automatically > repeated? > If so, what would be the "default" number of child docs cached by the > collector and where would I set that? > > Cheers > Mark > > > Below is the code I added to the existing TestBlockJoin which exercises the > above. > > //= > > public void testMultiChildTypes() throws Exception { > > final Directory dir = newDirectory(); > final RandomIndexWriter w = new RandomIndexWriter(random, dir); > > final List docs = new ArrayList(); > > docs.add(makeJob("java", 2007)); > docs.add(makeJob("python", 2010)); > docs.add(makeQualification("maths", 1999)); > docs.add(makeResume("Lisa", "United Kingdom")); > w.addDocuments(docs); > > IndexReader r = w.getReader(); > w.close(); > IndexSearcher s = new IndexSearcher(r); > > // Create a filter that defines "parent" documents in the index - in this > case resumes > Filter parentsFilter = new CachingWrapperFilter(new > QueryWrapperFilter(new TermQuery(new Term("docType", "resume"; > > // Define child document criteria (finds an example of relevant work > experience) > BooleanQuery childJobQuery = new BooleanQuery(); > childJobQuery.add(new BooleanClause(new TermQuery(new Term("skill", > "java")), Occur.MUST)); > childJobQuery.add(new BooleanClause(NumericRangeQuery.newIntRange("year", > 2006, 2011, true, true), Occur.MUST)); > > BooleanQuery childQualificationQuery = new BooleanQuery(); > childQualificationQuery.add(new BooleanClause(new TermQuery(new > Term("qualification", "maths")), Occur.MUST)); > childQualificationQuery.add(new > BooleanClause(NumericRangeQuery.newIntRange("year", 1980, 2000, true, true), > Occur.MUST)); > > > // Define parent document criteria (find a resident in the UK) > Query parentQuery = new TermQuery(new Term("country", "United Kingdom")); > > // Wrap the child document query to 'join' any matches > // up to corresponding parent: > BlockJoinQuery childJobJoinQuery = new BlockJoinQuery(childJobQuery, > parentsFilter, BlockJoinQuery.ScoreMode.Avg); > BlockJoinQuery childQualificationJoinQuery = new > BlockJoinQuery(childQualificationQuery, parentsFilter, > BlockJoinQuery.ScoreMode.Avg); > > // Combine the parent and nested child queries into a single query for a > candidate > BooleanQuery fullQuery = new BooleanQuery(); > fullQuery.add(new BooleanClause(parentQuery, Occur.MUST
Re: Relevancy, Phrase Boosting, Shingles and Long Tail Curves
Hi Mark I've played with Shingles recently in some auto-categorisation work where my starting assumption was that multi-word terms will hold more information value than individual words and that phrase queries on seperate terms will not give these term combos their true reward (in terms of IDF) - or if they did compute the true IDF, would require lots of disk IO to do this. Shingles present a conveniently pre-aggregated score for these combos. Looking at the results of MoreLikeThis queries based on a shingling analyzers the results I saw generally seemed good but did not formally bench mark this against non-shingled indexes. Not everything was rosy in that I did see some tendency to over-reward certain rare shingles (e.g. a shared mention of "New Years Eve Party" pulled otherwise mostly unrelated news articles together). This led me to look at using the links in resulting documents to help identify clusters of on-topic and potentially off-topic results to tune these discrepancies out but that's another topic. BTW, the Luke tool has a "Zipf" plugin that you may find useful in examining index term distributions in Lucene indexes.. Cheers Mark From: Mark Bennett To: java-...@lucene.apache.org Sent: Fri, 10 September, 2010 1:42:11 Subject: Relevancy, Phrase Boosting, Shingles and Long Tail Curves I want to boost the relevancy of some Question and Answer content. I'm using stop words, Dismax, and I'm already a fan of Phrase Boosting and have cranked that up a bit. But I'm considering using long Shingles to make use of some of the normally stopped out "junk words" in the content to help relevancy further. Reminder: "Shingles" are artificial tokens created by gluing together adjacent words. Input text: This is a sentence Normal tokens: this, is, a, sentence (without removing stop words) 2+3 word shingles: this-is, is-a, a-sentence, this-is-a, is-a-sentence A few questions on relevance and shingles: 1: How similar are the relevancy calculations compare between Shingles and exact phrases? I've seen material saying that shingles can give better performance than normal phrase searching, and I'm assuming this is exact phrase (vs. allowing for phrase slop) But do the relevancy calculations for normal exact phrase and Shingles wind up being *identical*, for the same documents and searches? That would seem an unlikely coincidence, but possibly it could have been engineered to intentionally behave that way. 2: What's the latest on Shingles and Dismax? The low front end low level tokenization in Dismax would seem to be a problem, but does the new parser stuff help with this? 3: I'm thinking of a minimum 3 word shingle, does anybody have comments on shingle length? Eyeballing the 2 word shingles, they don't seem much better than stop words. Obviously my shingle field bypasses stop words. But the 3 word shingles start to look more useful, expressing more intent, such as "how do i", "do i need" and "it work with", etc. Has there been any Lucene/Solr studies specifically on shingle length? and finally... 4: Is it useful to examine your token occurrences against a Power-Law log-log curve? So, with either single words, or shingles, you do a histogram, and then plot the histogram in an X-Y graph, with both axis being logarithmic. Then see if the resulting graph follows (or diverges) from a straight line. This "Long Tail" / Pareto / powerlaw mathematics were very popular a few years ago for looking at histograms of DVD rentals and human activities, and prior to the web, the power law and 80/20 rules has been observed in many other situations, both man made and natural. Also of interest, when a distribution is expected to follow a power line, but the actual data deviates from that theoretical line, then this might indicate some other factors at work, or so the theory goes. So if users' searches follow any type of histogram with a hidden powerlaw line, then it makes sense to me that the source content might also follow a similar distribution. Is the normal IDF ranking inspired by that type of curve? And *if* word occurrences, in either searches or source documents, were expected to follow a power law distribution, then possible shingles would follow such a curve as well. Thinking that document text, like many other things in nature, might follow such a curve, I used the Lucene index to generate such a curve. And I did the same thing for 3 word tokens. The 2 curves do have different slopes, but neither is very straight. So I was wondering if anybody else has looked at IDF curves (actually non-inverted document frequency curves) or raw word instance counts and power law graphs? I haven't found a smoking gun in my online searches, but I'm thinking some of you would know this. -- Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
Re: Relevancy, Phrase Boosting, Shingles and Long Tail Curves
>>What is the "best practices" formula for determining above average >>correlations >>of adjacent terms I gave this some thought in https://issues.apache.org/jira/browse/LUCENE-474 I found the Jaccard cooefficient favoured rare words too strongly and so went for a blend as shown below: public float getScore() { float overallIntersectionPercent = coIncidenceDocCount / (float) (termADocFreq + termBDocFreq); float termBIntersectionPercent = coIncidenceDocCount / (float) (termBDocFreq); //using just the termB intersection favours common words as // coincidents eg "new" food // return termBIntersectionPercent; //using just the overall intersection favours rare words as // coincidents eg "scezchuan" food //return overallIntersectionPercent; // so here we take an average of the two: return (termBIntersectionPercent + overallIntersectionPercent) / 2; } From: Mark Bennett To: dev@lucene.apache.org Sent: Fri, 10 September, 2010 18:44:31 Subject: Re: Relevancy, Phrase Boosting, Shingles and Long Tail Curves Thanks Mark H, Maybe I'll look at MLT (More Like This) again. I'll also check out zipf. It's claimed that Question and Answer wording is different enough for generic text content that different techniques might be indicated. From what I remember: 1: Though nouns normally convey 60% of relevancy in general text, Q&A content is skewed a bit more towards verbs. 2: Questions may contain more noise words (though perhaps in useful groupings) 3: Vocabulary mismatch of Interrogative vs. declarative / narrative (Q vs A) 4: Vocabulary mismatch of novices vs experts (Q vs A) It was item 2 that I was hoping to capitalize on with NGrams / Shingles. Still waiting for the relevancy math nerds to chime in about the log-log and IDF stuff ... ;-) I was thinking a bit more about the math involved here What is the "best practices" formula for determining above average correlations of adjacent terms, beyond what random chance would give. So you notice that "white" and "house" appear next to each other more than what chance distribution would explain, so you decide it's an important NGram. The "noise floor" isn't too bad for the typical shopping cart items calculation. You analyze the items present or not present in 1,000 shopping cart receipts. If grocery items were completely independent then "random" level is just the odds of the 2 items multiplied together: 1,000 shopping carts 200 have cereal 250 have milk chance of cereal = 200/1,000 = 20% milk = 250/1,000 = 25% IF independent then P(cereal AND milk) = P(cereal) * P(milk) 20% * 25% = 5% So 50 carts likely to have both cereal and milk And if MORE than 50 carts have cereal and milk, then it's worth noting. The classic example is diapers and beer, which is a bit apocryphal and NOT expected, but I like the breakfast cereal and milk example better because it IS expected. Now back to word-A appearing directly before word-B, and finding the base level number of times you'd expect just from random chance. Although Lucene/Luke gives you total word instances and document counts, what you'd really want is the number of possible N-Grams, which is affected by document boundaries, so it gets a little weird. Some other differences between the word-A word-B calculation vs milk and cereal: 1: I want ordered pairs, "white" before "house" 2: A document is NOT like a shopping cart in that I DO care how many times "white" appears before "house", whereas in the shopping carts I only cared about present or not present, so document count is less helpful here. I'm sure some companies and PHD's have super secret formulas for this, but I'd be content to just compare it to baseline random chance. Mark B -- Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513 On Fri, Sep 10, 2010 at 3:17 AM, mark harwood wrote: Hi Mark >I've played with Shingles recently in some auto-categorisation work where my >starting assumption was that multi-word terms will hold more information value >than individual words and that phrase queries on seperate terms will not give >these term combos their true reward (in terms of IDF) - or if they did compute >the true IDF, would require lots of disk IO to do this. Shingles present a >conveniently pre-aggregated score for these combos. >Looking at the results of MoreLikeThis queries based on a shingling analyzers >the results I saw generally seemed good but did not formally b
Document links
I've been looking at Graph Databases recently (neo4j, OrientDb, InfiniteGraph) as a faster alternative to relational stores. I notice they either embed Lucene for indexing node properties or (in the case of OrientDB) are talking about doing this. I think their fundamental performance advantage over relational stores is that they don't have to de-reference foreign keys in a b-tree index to get from a source node to a target node. Instead they use internally-generated IDs to act like pointers with more-or-less direct references between nodes/vertexes. As a result they can follow links very quickly. This got me thinking could Lucene adopt the idea of creating links between documents that were equally fast using Lucene doc ids? Maybe the user API would look something like this... indexWriter.addLink(fromDocId, toDocId); DocIdSet reader.getInboundLinks(docId); DocIdSet reader.getOutboundLinks(docId); Internally a new index file structure would be needed to record link info. Any recorded links that connect documents from different segments would need careful adjustment of referenced link IDs when segments merge and Lucene doc IDs are shuffled. As well as handling typical graphs (social networks, web data) this could potentially be used to support tagging operations where apps could create "tag" documents and then link them to existing documents that are being tagged without having to update the target doc. There are probably a ton of applications for this stuff. I see the Graph DBs busy recreating transactional support, indexes, segment merging etc and it seems to me that Lucene has a pretty good head start with this stuff. Anyone else think this might be an area worth exploring? Cheers Mark - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Document links
>>Wouldn't that be sufficient? Not for some apps. I tried playing the "Kevin Bacon" game using a Lucene index of IMDB data using actorID and movieID keys. The difference between that and Neo4j on the same data and query is night and day. The graph databases are really onto something when resolving a relationship doesn't first require an index to look up endpoints. - Original Message From: Paul Elschot To: dev@lucene.apache.org Sent: Tue, 21 September, 2010 17:25:31 Subject: Re: Document links When the (primary) key values are provided by the user, one could use additional small documents to only store/index these relations whenever they change. Wouldn't that be sufficient? Regards, Paul Elschot Op dinsdag 21 september 2010 00:35:02 schreef mark harwood: > I've been looking at Graph Databases recently (neo4j, OrientDb, > InfiniteGraph) > as a faster alternative to relational stores. I notice they either embed > Lucene > > for indexing node properties or (in the case of OrientDB) are talking about > doing this. > > I think their fundamental performance advantage over relational stores is > that > they don't have to de-reference foreign keys in a b-tree index to get from a > source node to a target node. Instead they use internally-generated IDs to > act > like pointers with more-or-less direct references between nodes/vertexes. As > a > > result they can follow links very quickly. This got me thinking could Lucene > adopt the idea of creating links between documents that were equally fast > using > > Lucene doc ids? > > Maybe the user API would look something like this... > > indexWriter.addLink(fromDocId, toDocId); > DocIdSet reader.getInboundLinks(docId); > DocIdSet reader.getOutboundLinks(docId); > > > Internally a new index file structure would be needed to record link info. > Any > recorded links that connect documents from different segments would need >careful > > adjustment of referenced link IDs when segments merge and Lucene doc IDs are > shuffled. > > As well as handling typical graphs (social networks, web data) this could > potentially be used to support tagging operations where apps could create > "tag" > > documents and then link them to existing documents that are being tagged >without > > having to update the target doc. There are probably a ton of applications for > this stuff. > > I see the Graph DBs busy recreating transactional support, indexes, segment > merging etc and it seems to me that Lucene has a pretty good head start with > this stuff. > Anyone else think this might be an area worth exploring? > > Cheers > Mark > > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Document links
> It should be possible to randomly add and delete such relationships after > indexWriter.addDocument(), is that the idea? Yes. A "like" action may, for example allow me to tag an existing document by connecting 2 documents - my personal "like" document and a document with content of interest. doc 1 = [user:mark tag:like] doc 56 = [title:Lucenebody:Lucene is a search library...] I then call: indexWriter.addLink(1,56) If this was my first "Like" then I may need to contemplate using a variation of the above API that allows a yet-to-be-committed "Document" object in place of the doc ids. > Adding such relationships by docId would need the addition of > a separate (from the segments) index structure Yes, I need to think about the detail of file structures next. For now I'm sticking with thinking about user API and functionality and assuming we can maintain cross-segment docid references that get updated somehow at merge time. > > > Would each link also have an attribute (think payload)? I was thinking if attributes are needed (e.g. a star rating on my document "like" example) then this could be catered for with a document e.g. rather than linking the single doc [user:mark tag:like] to all my liked docs I could create specific doc instances of [user:mark rating:5 tag:like] and linking via that. > Would such relationships be named (sth like foreign key field names)? For now I was thinking of storing simple docid->docid links. Once we have these links we could do some funky things: {pseudo code:} //My fave docs from last week int myLikesDocId=searchForLuceneDocWithUserNameAndTag("mark", "like"); DocIdSet myLikedDocs =indexReader.getOutboundLinks(myLikesDocId) searcher.search(lastWeekRangeQuery, new Filter(myLikedDocs)); //Other users who share my interests DocIdSet usersWhoLikeWhatILike = indexReader.getInboundLinks(myLikedDocs); Cheers Mark
Re: Document links
Some inital thoughts on the challenges in maintaining docid->docid links: https://spreadsheets.google.com/ccc?key=0AsKVSn5SGg_wdHhMUW9ya0xxUFI3VXBHZGZHVUo4RkE&hl=en&authkey=CLOmwrgL#gid=0 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Document links
This slideshow has a first-cut on the Lucene file format extensions required to support fast linking between documents: http://www.slideshare.net/MarkHarwood/linking-lucene-documents Interested in any of your thoughts. Cheers, Mark - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Document links
>>While not exactly equivalent, it reminds me of our earlier discussion around >>"layered segments" for dealing with field updates Right. Fast discovery of document relations is a foundation on which lots of things like this can build. Relations can be given types to support a number of different use cases. - Original Message From: Grant Ingersoll To: dev@lucene.apache.org Sent: Fri, 24 September, 2010 16:26:27 Subject: Re: Document links While not exactly equivalent, it reminds me of our earlier discussion around "layered segments" for dealing with field updates [1], [2], albeit this is a bit more generic since one could not only use the links for relating documents, but one could use "special" links underneath the covers in Lucene to maintain/mark which fields have been updated and then traverse to them. [1] http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384 [2] http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e On Sep 24, 2010, at 10:36 AM, mark harwood wrote: > This slideshow has a first-cut on the Lucene file format extensions required > to > > support fast linking between documents: > > http://www.slideshare.net/MarkHarwood/linking-lucene-documents > > > Interested in any of your thoughts. > > Cheers, > Mark > > > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > -- Grant Ingersoll http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Document links
My starting point in the solution I propose was to eliminate linking via any type of key. Key lookups mean indexes and indexes mean disk seeks. Graph traversals have exponential numbers of links and so all these index disk seeks start to stack up. The solution I propose uses doc ids as more-or-less direct pointers into file structures avoiding any index lookup. I've started coding up some tests using the file structures I outlined and will compare that with a traditional key-based approach. For reference - playing the "Kevin Bacon game" on a traditional Lucene index of IMDB data took 18 seconds to find a short path that Neo4j finds in 200 milliseconds on the same data (and this was a disk based graph of 3m nodes, 10m edges). Going from actor->movies->actors->movies produces a lot of key lookups and the difference between key indexes and direct node pointers becomes clear. I know path finding analysis is perhaps not a typical Lucene application but other forms of link analysis e.g. recommendation engines require similar performance. Cheers Mark On 25 Sep 2010, at 11:41, Paul Elschot wrote: > Op vrijdag 24 september 2010 17:57:45 schreef mark harwood: >>>> While not exactly equivalent, it reminds me of our earlier discussion >>>> around >>>> "layered segments" for dealing with field updates >> >> Right. Fast discovery of document relations is a foundation on which lots of >> things like this can build. Relations can be given types to support a number >> of >> different use cases. > > How about using this (bsd licenced) tree as a starting point: > http://bplusdotnet.sourceforge.net/ > It has various keys: ao. byte array, String and long. > > A fixed size byte array as key seems to be just fine: two bytes for a field > number, > four for the segment number and four for the in-segment document id. > The separate segment number would allow to minimize the updates > in the tree during merges. One could also use the normal doc id directly. > > The value could then be a similar to the key, but without > the field number, and with an indication of the direction of the link. > Or perhaps the direction of the link should be added to the key. > A link would be present twice, once for each direction. > Also both directions could have their own payloads. > > It could be put in its own file as a separate 'segment', or maybe > each segment could allow for allocation of a part of this tree. > > I like this somehow, in case it is done right one might never > need a relational database again. Well, almost... > > Regards, > Paul Elschot > > >> >> >> >> - Original Message >> From: Grant Ingersoll >> To: dev@lucene.apache.org >> Sent: Fri, 24 September, 2010 16:26:27 >> Subject: Re: Document links >> >> While not exactly equivalent, it reminds me of our earlier discussion around >> "layered segments" for dealing with field updates [1], [2], albeit this is a >> bit >> more generic since one could not only use the links for relating documents, >> but >> one could use "special" links underneath the covers in Lucene to >> maintain/mark >> which fields have been updated and then traverse to them. >> >> [1] >> http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384 >> >> [2] >> http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e >> >> >> On Sep 24, 2010, at 10:36 AM, mark harwood wrote: >> >>> This slideshow has a first-cut on the Lucene file format extensions >>> required to >>> >>> support fast linking between documents: >>> >>> http://www.slideshare.net/MarkHarwood/linking-lucene-documents >>> >>> >>> Interested in any of your thoughts. >>> >>> Cheers, >>> Mark >>> >>> >>> >>> >>> >>> - >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >> >> -- >> Grant Ingersoll >> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8 >> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >> > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Document links
>>Both these on disk data structures and the ones in a B+ tree have seek >>offsets >>into files >>that require disk seeks. And both could use document ids as key values. Yep. However my approach doesn't use a doc id as a key that is searched in any B+ tree index (which involves disk seeks) - it is used as direct offset into a file to get the pointer into a "links" data structure. >>But do these disk data structures support dynamic addition and deletion of >>(larger >>numbers of) document links? Yes, the slide deck I linked to shows how links (like documents) spend the early stages of life being merged frequently in the smaller, newer segments and over time migrate into larger, more stable segments as part of Lucene transactions. That's the theory - I'm currently benchmarking an early prototype. - Original Message From: Paul Elschot To: dev@lucene.apache.org Sent: Sat, 25 September, 2010 22:03:28 Subject: Re: Document links Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood: > My starting point in the solution I propose was to eliminate linking via any >type of key. Key lookups mean indexes and indexes mean disk seeks. Graph >traversals have exponential numbers of links and so all these index disk seeks >start to stack up. The solution I propose uses doc ids as more-or-less direct >pointers into file structures avoiding any index lookup. > I've started coding up some tests using the file structures I outlined and > will >compare that with a traditional key-based approach. Both these on disk data structures and the ones in a B+ tree have seek offsets into files that require disk seeks. And both could use document ids as key values. But do these disk data structures support dynamic addition and deletion of (larger numbers of) document links? B+ trees are a standard solution for problems like this one, and it would probably not be easy to outperform them. It may be possible to improve performance of B+ trees somewhat by specializing for the fairly simple keys that would be needed, and by encoding very short lists of links for a single document directly into a seek offset to avoid the actual seek, but that's about it. Regards, Paul Elschot > > For reference - playing the "Kevin Bacon game" on a traditional Lucene index > of >IMDB data took 18 seconds to find a short path that Neo4j finds in 200 >milliseconds on the same data (and this was a disk based graph of 3m nodes, >10m >edges). > Going from actor->movies->actors->movies produces a lot of key lookups and > the >difference between key indexes and direct node pointers becomes clear. > I know path finding analysis is perhaps not a typical Lucene application but >other forms of link analysis e.g. recommendation engines require similar >performance. > > Cheers > Mark > > > > On 25 Sep 2010, at 11:41, Paul Elschot wrote: > > > Op vrijdag 24 september 2010 17:57:45 schreef mark harwood: > >>>> While not exactly equivalent, it reminds me of our earlier discussion >around > > >>>> "layered segments" for dealing with field updates > >> > >> Right. Fast discovery of document relations is a foundation on which lots > >> of > > >> things like this can build. Relations can be given types to support a > >> number >of > > >> different use cases. > > > > How about using this (bsd licenced) tree as a starting point: > > http://bplusdotnet.sourceforge.net/ > > It has various keys: ao. byte array, String and long. > > > > A fixed size byte array as key seems to be just fine: two bytes for a field >number, > > four for the segment number and four for the in-segment document id. > > The separate segment number would allow to minimize the updates > > in the tree during merges. One could also use the normal doc id directly. > > > > The value could then be a similar to the key, but without > > the field number, and with an indication of the direction of the link. > > Or perhaps the direction of the link should be added to the key. > > A link would be present twice, once for each direction. > > Also both directions could have their own payloads. > > > > It could be put in its own file as a separate 'segment', or maybe > > each segment could allow for allocation of a part of this tree. > > > > I like this somehow, in case it is done right one might never > > need a relational database again. Well, almost... > > > > Regards, > > Paul Elschot > > > > > >> > >> > >> > >> - Original Message >
Re: Polymorphic Index
Perhaps another way of thinking about the problem: Given a large range of IDs (eg your 300 million) you could constrain the number of unique terms using a double-hashing technique e.g. Pick a number "n" for the max number of unique terms you'll tolerate e.g. 1 million and store 2 terms for every primary key using a different hashing function e.g. int hashedKey1=hashFunction1(myKey)%maxNumUniqueTerms. int hashedKey2=hashFunction2(myKey)%maxNumUniqueTerms. Then queries to retrieve/delete a record use a search for hashedKey1 AND hashedKey2. The probability of having the same collision on two different hashing functions is minimal and should return the original record only. Obviously you would still have the postings recorded but these would be slightly more compact e.g each of your 1 million unique terms would have ~300 gap-encoded vints entries as opposed to 300m postings of one full int. Cheers Mark On 21 Oct 2010, at 20:44, eks dev wrote: > Hi All, > I am trying to figure out a way to implement following use case with > lucene/solr. > > > In order to support simple incremental updates (master) I need to index and > store UID Field on 300Mio collection. (My UID is a 32 byte sequence). But I > do > not need indexed (only stored) it during normal searching (slaves). > > > The problem is that my term dictionary gets blown away with sheer number of > unique IDs. Number of unique terms on this collection, excluding UID is less > than 7Mio. > I can tolerate resources hit on Updater (big hardware, on disk index...). > > This is a master slave setup, where searchers run from RAMDisk and having > 300Mio * 32 (give or take prefix compression) plus pointers to postings and > postings is something I would really love to avoid as this is significant > compared to really small documents I have. > > > Cutting to the chase: > How I can have Indexed UID field, and when done with indexing: > 1) Load "searchable" index into ram from such an index on disk without one > field? > > 2) create 2 Indices in sync on docIDs, One containing only indexed UID > 3) somehow transform index with indexed UID by droping UID field, preserving > docIs. Kind of tool smart index-editing tool. > > Something else already there i do not know? > > Preserving docIds is crucial, as I need support for lovely incremental > updates > (like in solr master-slave update). Also Stored field should remain! > I am not looking for "use MMAPed Index and let OS deal with it advice"... > I do not mind doing it with flex branch 4.0, nut being in a hurry. > > Thanks in advance, > Eks > > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Polymorphic Index
Good point, Toke. Forgot about that. Of course doubling the number of hash algos used to 4 increases the space massively. On 21 Oct 2010, at 22:51, Toke Eskildsen wrote: > Mark Harwood [markharw...@yahoo.co.uk]: >> Given a large range of IDs (eg your 300 million) you could constrain >> the number of unique terms using a double-hashing technique e.g. >> Pick a number "n" for the max number of unique terms you'll tolerate >> e.g. 1 million and store 2 terms for every primary key using a >> different hashing function e.g. > >> int hashedKey1=hashFunction1(myKey)%maxNumUniqueTerms. >> int hashedKey2=hashFunction2(myKey)%maxNumUniqueTerms. > >> Then queries to retrieve/delete a record use a search for hashedKey1 >> AND hashedKey2. The probability of having the same collision on two >> different hashing functions is minimal and should return the original record >> only. > > I am sorry, but this won't work. It is a variation of the birthday paradox: > http://en.wikipedia.org/wiki/Birthday_problem > > Assuming the two hash-functions are ideal so that there will be 1M different > values from each after the modulo, the probability for any given pair of > different UIDs having the same hashes is 1/(1M * 1M). That's very low. > Another way to look at it would be to say that there are 1M * 1M possible > values for the aggregated hash function. > > Using the recipe from > http://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem > we have > n = 300M > d = 1M * 1M > and the formula 1-((d-1)/d)^(n*(n-1)/2) which gets expanded to > http://www.google.com/search?q=1-(((1-1)/1)^(3*(3-1)/2) > > We see that the probability of a collision is ... 1. Or rather, so close to 1 > that Google's calculator will not show any decimals. Turning the number of > UIDs down to just 3M, we still get the probability 0.99881 for a > collision. It does not really help to increase the number unique hashes as > there are simply too many possible pairs of UIDs. > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Using filters to speed up queries
Look at BooleanQuery with 2 "must" clauses - one for the query, one for a ConstantScoreQuery wrapping the filter. BooleanQuery should then use automatically use skips when reading matching docs from the main query and skip to the next docs identified by the filter. Give it a try, otherwise you may be looking at using separate indexes On 23 Oct 2010, at 23:18, Khash Sajadi wrote: > My index contains documents for different users. Each document has the user > id as a field on it. > > There are about 500 different users with 3 million documents. > > Currently I'm calling Search with the query (parsed from user) and > FieldCacheTermsFilter for the user id. > > It works but the performance is not great. > > Ideally, I would like to perform the search only on the documents that are > relevant, this should make it much faster. However, it seems Search(Query, > Filter) runs the query first and then applies the filter. > > Is there a way to improve this? (i.e. run the query only on a subset of > documents) > > Thanks - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: How can I get started for investigating the source code of Lucene ?
Here's a rough overview I mapped out as a sequence diagram for the search side of things some time ago: http://goo.gl/lE6a - Original Message From: Jeff Zhang To: dev@lucene.apache.org Sent: Mon, 1 November, 2010 5:43:08 Subject: How can I get started for investigating the source code of Lucene ? Hi all, I'd like to study the source code of Lucene, but I found there's not so much documents about the internal structure of lucene. And the classes are so big that not so readable, could anyone give me suggestion about How can I get started for investigating the source code of Lucene ? Any document or blog post would be good . Thanks -- Best Regards Jeff Zhang - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Document links
I came to the conclusion that the transient meaning of document ids is too deeply ingrained in Lucene's design to use them to underpin any reliable linking. While it might work for relatively static indexes, any index with a reasonable number of updates or deletes will invalidate any stored document references in ways which are very hard to track. Lucene's compaction shuffles IDs without taking care to preserve identity, unlike graph DBs like Neo4j (see "recycling IDs" here: http://goo.gl/5UbJi ) Cheers, Mark - Original Message From: Ryan McKinley To: dev@lucene.apache.org Sent: Mon, 8 November, 2010 19:03:59 Subject: Re: Document links Any updates/progress with this? I'm looking at ways to implement an RTree with lucene -- and this discussion seems relevant thanks ryan On Sat, Sep 25, 2010 at 5:42 PM, mark harwood wrote: >>>Both these on disk data structures and the ones in a B+ tree have seek offsets >>>into files >>>that require disk seeks. And both could use document ids as key values. > > Yep. However my approach doesn't use a doc id as a key that is searched in any > B+ tree index (which involves disk seeks) - it is used as direct offset into a > file to get the pointer into a "links" data structure. > > > >>>But do these disk data structures support dynamic addition and deletion of >>>(larger >>>numbers of) document links? > > Yes, the slide deck I linked to shows how links (like documents) spend the >early > stages of life being merged frequently in the smaller, newer segments and over > time migrate into larger, more stable segments as part of Lucene transactions. > > That's the theory - I'm currently benchmarking an early prototype. > > > > - Original Message > From: Paul Elschot > To: dev@lucene.apache.org > Sent: Sat, 25 September, 2010 22:03:28 > Subject: Re: Document links > > Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood: >> My starting point in the solution I propose was to eliminate linking via any >>type of key. Key lookups mean indexes and indexes mean disk seeks. Graph >>traversals have exponential numbers of links and so all these index disk seeks >>start to stack up. The solution I propose uses doc ids as more-or-less direct >>pointers into file structures avoiding any index lookup. >> I've started coding up some tests using the file structures I outlined and >will >>compare that with a traditional key-based approach. > > Both these on disk data structures and the ones in a B+ tree have seek offsets > into files > that require disk seeks. And both could use document ids as key values. > > But do these disk data structures support dynamic addition and deletion of > (larger > numbers of) document links? > > B+ trees are a standard solution for problems like this one, and it would > probably > not be easy to outperform them. > It may be possible to improve performance of B+ trees somewhat by specializing > for the fairly simple keys that would be needed, and by encoding very short > lists of links > for a single document directly into a seek offset to avoid the actual seek, but > that's > about it. > > Regards, > Paul Elschot > >> >> For reference - playing the "Kevin Bacon game" on a traditional Lucene index >of >>IMDB data took 18 seconds to find a short path that Neo4j finds in 200 >>milliseconds on the same data (and this was a disk based graph of 3m nodes, 10m >>edges). >> Going from actor->movies->actors->movies produces a lot of key lookups and the >>difference between key indexes and direct node pointers becomes clear. >> I know path finding analysis is perhaps not a typical Lucene application but >>other forms of link analysis e.g. recommendation engines require similar >>performance. >> >> Cheers >> Mark >> >> >> >> On 25 Sep 2010, at 11:41, Paul Elschot wrote: >> >> > Op vrijdag 24 september 2010 17:57:45 schreef mark harwood: >> >>>> While not exactly equivalent, it reminds me of our earlier discussion >>around >> >> >>>> "layered segments" for dealing with field updates >> >> >> >> Right. Fast discovery of document relations is a foundation on which lots >of >> >> >> things like this can build. Relations can be given types to support a >number >>of >> >> >> different use cases. >> > >> > How about using this (bsd licenced) tree as a starting point: >> > http://bplusdotnet.sourceforge.net/ >> > It has various keys: ao. byt
Re: Document links
What about if we define an id field (like in solr)? Last time I floated the idea of supporting primary keys as a core concept in Lucene (in the context of helping doc updates, not linking) there were objections along the lines of "lucene shouldn't try to be a database" On 8 Nov 2010, at 20:47, Ryan McKinley wrote: On Mon, Nov 8, 2010 at 2:52 PM, mark harwood wrote: I came to the conclusion that the transient meaning of document ids is too deeply ingrained in Lucene's design to use them to underpin any reliable linking. What about if we define an id field (like in solr)? Whatever does the traversal would need to make a Map, but that is still better then then needing to do a query for each link. While it might work for relatively static indexes, any index with a reasonable number of updates or deletes will invalidate any stored document references in ways which are very hard to track. Lucene's compaction shuffles IDs without taking care to preserve identity, unlike graph DBs like Neo4j (see "recycling IDs" here: http://goo.gl/5UbJi ) oh ya -- and it is even more akward since each subreader often reuses the same docId ryan - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Document links
I was using within-segment doc ids stored in link files named after both the source and target segments (a link after all is 2 endpoints). For a complete solution you ultimately have to deal with the fact that doc ids could be references to: * Stable, committed docs (the easy case) * Flushed but not yet committed docs * Buffered but not yet flushed docs * Flushed/committed but currently merging docs ...all of which are happening in different threads e.g reader has one view of the world, a background thread is busy merging segments to create a new view of the world even after commits have completed. All very messy. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: New Lucene features and Solr indexes
>>Instead of making other APIs to accomodate BloomFilter's current >>brokenness: remove its custom per-field logic so it works with >>PerFieldPostingsFormat, like every other PF. Not looked at it in a while but I'm pretty certain, like every other PF, you can go ahead and use PerFieldPF with Bloom filter just fine. What was broken was (is?) that in this configuration PFPF isn't smart enough to avoid creating twice as many files as is required - see Lucene 4093. Until that is resolved (and I have noted my pessimism about that being fixed easily) BloomPF contains an optimisation for those that want to avoid this inefficiency. The use of that optimisation is entirely optional for users. Internally to BloomPF, the implementation of that optimisation is trivial - if a null bloom set is returned for a given field it ignores the usual bloom filtering logic and delegates directly to the wrapped codec. You can choose to implement a BloomFilterFactory that adds this field-choice optimisation or, more simply run the default PerFieldPF-managed configuration and live with the increased numbers of files. Arguably, the inefficiencies of the PerFieldPF framework are the real issue to be addressed here. >>I brought this up before it was committed, and i was ignored You stopped engaging in the debate when I outlined the 3 proposed options for moving BloomPF forward : http://goo.gl/mxtP9 Those options were: 1) ignore the inefficiencies in PFPF 2) sort out the issues in PFPF (4093 but probably a more complex solution) 3) work around existing PFPF issues with a simple but entirely optional optimisation to BloomPF I opted for 3) and gave notice that I 'd take it out if anyone objected. I don't think there's been any movement on 2) so I guess you're still happy with option 1)? I recall you didn't think the business of extra files was that much of a concern: http://goo.gl/eJWo3 (Incidentally, probably best following up on the relevant Jiras rather than here) Cheers Mark From: Robert Muir To: dev@lucene.apache.org Sent: Wednesday, 13 February 2013, 13:01 Subject: Re: New Lucene features and Solr indexes On Wed, Feb 13, 2013 at 4:42 AM, Adrien Grand wrote: > Hi Shawn, > > On Tue, Feb 12, 2013 at 8:58 PM, Shawn Heisey wrote: >> Some of these, like compressed stored fields and compressed termvectors, are >> being turned on by default, which is awesome. I'm already running a 4.2 >> snapshot, so I've got those in place. > > Excellent! > >> One thing that I know I would like to do is use the new BloomFilter for a >> couple of my fields that contain only unique values. Last time I checked >> (which was before the 4.1 release), if you added the lucene-codecs jar, Solr >> had a BloomFilter postings format, but didn't have any way to specify the >> underlying format. See SOLR-3950 and LUCENE-4394. > > BloomFilterPostingsFormat is a little special compared to other > postings formats because it can wrap any postings format. So maybe it > should require special support, like an additional attribute in the > field type definition? -1 Instead of making other APIs to accomodate BloomFilter's current brokenness: remove its custom per-field logic so it works with PerFieldPostingsFormat, like every other PF. In other words, it should work just like pulsing. I brought this up before it was committed, and i was ignored. Thats fine, but I'll be damned if i let its incorrect design complicate other parts of the codebase too. I'd rather it continue to stay difficult to integrate and continue walking its current path to an open source death instead. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: New Lucene features and Solr indexes
>>should be a stupid simple postings format like any other postings format with >>a default configuration It does have a default config. It just needs a PF delegate in the constructor just like Pulsing Like Rob said: >>In other words, it should work just like pulsing. So far so good. Now where people are getting upset (for no particularly good reason in my view) around per-field stuff: if you really, really want to you can supply a subclass of BloomFilterFactory to your BloomPF constructor which allows customised control over choice of hashing algo, bitset sizing and saturation policies if the DefaultBloomFilterFactory fails to make the right choices. 99.9% of people will not do this. The reason it is a factory object and not some dumb settings is that it is called on a per-segment basis with state info that is useful context in making sizing choices. Now, (horror of horrors), the factory's API is passed a FieldInfo object in the method designed to produce a bitset. It is conceivable that some rogue agents could choose to implement some per-field decisions here if the same BloomPF instance was registered to handle >1 field. In addition, BloomPF has some common-sense defensive coding that checks if the factory returns null for the bitset - in which case it delegates all calls un-bloomed directly to the delegate codec. None of this prevents the use of BloomPF with the prescribed PerFieldPF manner for handling field-specific choices. I happen to use a custom BloomFilterFactory to implement a more efficient indexing pipeline than the prescribed PerFieldPF route of implementing all per-field policies "up high" in the stack - but none of that is at the cost of a clean BloomPF API or with any unnecessary duplication of PerFieldPF logic. If anything needs changing here there may be a case for providing a convenience class that weds BloomPF and a default choice of Lucene40 codec so it can help with whatever Solr and other config-driven engines may need ie zero arg constructors if that's how their registry of codecs works. Cheers Mark From: Uwe Schindler To: dev@lucene.apache.org Sent: Wednesday, 13 February 2013, 16:47 Subject: RE: New Lucene features and Solr indexes Hi Shawn, I was arguing also at the time when this was committed. I fully agree with Robert, the current API is not in a good shape! I have the same feeling: Bloom Postings should be a stupid simple postings format like any other postings format with a default configuration. If you really want to change its configuration, you can subclass it as a separate postings format. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Shawn Heisey [mailto:s...@elyograg.org] > Sent: Wednesday, February 13, 2013 3:59 PM > To: dev@lucene.apache.org > Subject: Re: New Lucene features and Solr indexes > > >> BloomFilterPostingsFormat is a little special compared to other > >> postings formats because it can wrap any postings format. So maybe it > >> should require special support, like an additional attribute in the > >> field type definition? > > > > -1 > > > > Instead of making other APIs to accomodate BloomFilter's current > > brokenness: remove its custom per-field logic so it works with > > PerFieldPostingsFormat, like every other PF. > > > > In other words, it should work just like pulsing. > > > > I brought this up before it was committed, and i was ignored. Thats > > fine, but I'll be damned if i let its incorrect design complicate > > other parts of the codebase too. I'd rather it continue to stay > > difficult to integrate and continue walking its current path to an > > open source death instead. > > Robert, > > I have to send you a general thank you for your dedication to the quality of > this project, and for your amazing ability to seemingly keep the entire design > for Lucene in your head at all times. > > I'm not sure what exactly you want to die here, or what you think would be > the best option for me, the Solr end-user. Is BloomFilter something that's > not worth pursuing, or would you just like it to be integrated in a different > way? > > Thanks, > Shawn > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional > commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
Mark Harwood created LUCENE-4069: Summary: Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6 Reporter: Mark Harwood Priority: Minor Fix For: 3.6.1 An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: MHBloomFilterOn3.6Branch.patch Initial patch > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6 >Reporter: Mark Harwood >Priority: Minor > Fix For: 3.6.1 > > Attachments: MHBloomFilterOn3.6Branch.patch > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no API changes currently - to play just add a field with "_blm" on > the end of the name to invoke special indexing/querying capability. Clearly a > new Field or schema declaration(!) would need adding to APIs to configure the > service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: PrimaryKey40PerformanceTestSrc.zip BloomFilterCodec40.patch > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6 >Reporter: Mark Harwood >Priority: Minor > Fix For: 3.6.1 > > Attachments: BloomFilterCodec40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no API changes currently - to play just add a field with "_blm" on > the end of the name to invoke special indexing/querying capability. Clearly a > new Field or schema declaration(!) would need adding to APIs to configure the > service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterCodec40.patch PrimaryKey40PerformanceTestSrc.zip I've ported this Bloom Filtering code to work as a 4.0 Codec now. I see a 35% improvement over standard Codecs on random lookups on a warmed index. I also notice that the PulsingCodec is no longer faster than standard Codec - is this news to people as I thought it was supposed to be the way forward? My test rig (adapted from Mike's original primary key test rig here http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html) is attached as a zip. The new BloomFilteringCodec is also attached here as a patch. Searches against plain text fields also look to be faster (using AOL500k queries searching Wikipedia English) but obviously that particular test rig is harder to include as an attachment here. I can open a seperate JIRA issue for this 4.0 version of the code if that makes more sense. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6 >Reporter: Mark Harwood >Priority: Minor > Fix For: 3.6.1 > > Attachments: BloomFilterCodec40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no API changes currently - to play just add a field with "_blm" on > the end of the name to invoke special indexing/querying capability. Clearly a > new Field or schema declaration(!) would need adding to APIs to configure the > service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterCodec40.patch) > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6 >Reporter: Mark Harwood >Priority: Minor > Fix For: 3.6.1 > > Attachments: BloomFilterCodec40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no API changes currently - to play just add a field with "_blm" on > the end of the name to invoke special indexing/querying capability. Clearly a > new Field or schema declaration(!) would need adding to APIs to configure the > service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: PrimaryKey40PerformanceTestSrc.zip) > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6 >Reporter: Mark Harwood >Priority: Minor > Fix For: 3.6.1 > > Attachments: BloomFilterCodec40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no API changes currently - to play just add a field with "_blm" on > the end of the name to invoke special indexing/querying capability. Clearly a > new Field or schema declaration(!) would need adding to APIs to configure the > service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283583#comment-13283583 ] Mark Harwood commented on LUCENE-4069: -- My current focus is speeding up primary key lookups but this may have applications outside of that (Zipf tells us there is a lot of low frequency stuff in free text). Following the principle of the best IO is no IO the Bloom Filter helps us quickly understand which segments to even bother looking in. That has to be a win overall. I started trying to write this Codec as a wrapper for any other Codec (it simply listens to a stream of terms and stores a bitset of recorded hashes in a ".blm" file). However that was trickier than I expected - I'd need to encode a special entry in my blm files just to know the name of the delegated codec I needed to instantiate at read-time because Lucene's normal Codec-instantiation logic would be looking for "BloomCodec" and I'd have to discover the delegate that was used to write all of the non-blm files. Not looked at FixedBitSet but I imagine that could be used instead. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6 >Reporter: Mark Harwood >Priority: Minor > Fix For: 3.6.1 > > Attachments: BloomFilterCodec40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no API changes currently - to play just add a field with "_blm" on > the end of the name to invoke special indexing/querying capability. Clearly a > new Field or schema declaration(!) would need adding to APIs to configure the > service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283615#comment-13283615 ] Mark Harwood commented on LUCENE-4069: -- Update- I've discovered this Bloom Filter Codec currently has a bug where it doesn't handle indexes with >1 field. It's probably all tangled up in the "PerField..." codec logic so I need to do some more digging. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index > Affects Versions: 3.6 >Reporter: Mark Harwood >Priority: Minor > Fix For: 3.6.1 > > Attachments: BloomFilterCodec40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no API changes currently - to play just add a field with "_blm" on > the end of the name to invoke special indexing/querying capability. Clearly a > new Field or schema declaration(!) would need adding to APIs to configure the > service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterCodec40.patch) > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6 >Reporter: Mark Harwood >Priority: Minor > Fix For: 3.6.1 > > Attachments: MHBloomFilterOn3.6Branch.patch, > PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no API changes currently - to play just add a field with "_blm" on > the end of the name to invoke special indexing/querying capability. Clearly a > new Field or schema declaration(!) would need adding to APIs to configure the > service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterCodec40.patch Fixed the issue with >1 field in an index. Tests on random lookups on Wikipedia titles (unique keys) now show a 3 x speed up for a Bloom-filtered index over standard 4.0 Codec for fully warmed indexes. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6 > Reporter: Mark Harwood >Priority: Minor > Fix For: 3.6.1 > > Attachments: BloomFilterCodec40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no API changes currently - to play just add a field with "_blm" on > the end of the name to invoke special indexing/querying capability. Clearly a > new Field or schema declaration(!) would need adding to APIs to configure the > service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284536#comment-13284536 ] Mark Harwood commented on LUCENE-4069: -- bq. I wonder if that also helps indexing in terms of applying deletes. did you test that I've not looked into that particularly but I imagine this may be relevant. Thanks for the tips for making the patch more generic. I'll get on it tomorrow and changed to FixedBitSet while I'm at it. bq. I also wonder if we can extract a "bloomfilter" class into utils There's some reusable stuff in this patch for downsizing the Bitset (according to desired saturation levels) having accumulated a stream of values. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6 >Reporter: Mark Harwood >Priority: Minor > Fix For: 3.6.1 > > Attachments: BloomFilterCodec40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no API changes currently - to play just add a field with "_blm" on > the end of the name to invoke special indexing/querying capability. Clearly a > new Field or schema declaration(!) would need adding to APIs to configure the > service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterCodec40.patch) > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6 >Reporter: Mark Harwood >Priority: Minor > Fix For: 3.6.1 > > Attachments: MHBloomFilterOn3.6Branch.patch, > PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no API changes currently - to play just add a field with "_blm" on > the end of the name to invoke special indexing/querying capability. Clearly a > new Field or schema declaration(!) would need adding to APIs to configure the > service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterCodec40.patch Updated to work with trunk. * Changed to use FixedBitSet * Is now a PostingsFormat abstract base class * Added missing MurmurHash class TODOs * Move Bloom filter logic to common utils classes * Use Service Providers for pluggable choice of hash algos? * Expose settings for memory/saturation > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6 >Reporter: Mark Harwood >Priority: Minor > Fix For: 3.6.1 > > Attachments: BloomFilterCodec40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no API changes currently - to play just add a field with "_blm" on > the end of the name to invoke special indexing/querying capability. Clearly a > new Field or schema declaration(!) would need adding to APIs to configure the > service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Description: An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat was: An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Affects Version/s: 4.0 Fix Version/s: 4.0 > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterCodec40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285100#comment-13285100 ] Mark Harwood commented on LUCENE-4069: -- bq. I think you should really not provide any field handling at all. By that I think you mean keep the abstract BloomFilteringPostingsFormatBase and dispense with the BloomFilteringLucene40Codec (and BloomFilteredLucene40PostingsFormat?). I was trying to limit the extensions apps would have to write to use this service (1 custom postings format subclass, 1 custom Codec subclass and 1 custom SPI config file) but I can see that equally we shouldn't offer implementations for all the many different service permutations. I'll look at adding something to RandomCodec for Bloom-plus-random delegate PostingsFormat. bq. I am still worried about the TermsEnum reuse code, are you planning to look into this? bq. you keep on switching back an forth creating new delegated TermsEnum instances I'm not sure what you mean in creating new delegated TermsEnum instances? In my impl of"iterator(TermsEnum reuse)" I take care to unwrap my wrapper for TermsEnum to find the original delegate's TermsEnum and then call the delegateTerms iterator method with this object as the reuse parameter. At this stage shouldn't the delegate Terms just recycle that unwrapped TermsEnum as per the normal reuse contract when no wrapping has been done? bq. you should also add license headers to the files you are adding Will do. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterCodec40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285518#comment-13285518 ] Mark Harwood commented on LUCENE-4069: -- Thanks for the comment, Rob. While the choice of Codec can be an anonymous inner class, resolving the choice of PostingsFormat is trickier. BloomFilterPostingsFormat is now intended to wrap any another choice of PostingsFormat and Simon has suggested leaving the Bloom support purely abstract. However, as an end user if I want to use Bloom support on the standard Lucene codec I would then have to write one of these: {code:title=MyBloomFilteredLucene40Postings.java} public class MyBloomFilteredLucene40Postingsextends BloomFilteringPostingsFormatBase { public MyBloomFilteredLucene40Postings() { //declare my choice of PostingsFormat to be wrapped and provide a unique name for this combo of Bloom-plus-delegate super("myBL40", new Lucene40PostingsFormat()); } } {code} The resulting index files are then named [segname]_myBL40.[filetypesuffix]. At read-time the "myBL40" bit of the filename is used to lookup via Service Provider registrations the decoding class so "com.xx.MyBloomFilteredLucene40Postings" would need adding to a o.a.l.codecs.PostingsFormat file for the registration to work. I imagine Bloom-plus-Lucene40Postings would be a common combo and if both are in core it would be annoying to have to code support for this in each app or for things like Luke to have to have classpaths redefined to access the app-specific class that was created purely to bind this combo of core components. I think a better option might be to change the Bloom filtering base class to record the choice of delegate PostingsFormat in it's own "blm" file at write-time and instantiate the appropriate delegate instance at read-time using the recorded name. The BloomFilteringBaseClass would need changing to a final class rather than an abstract so that core Lucene could load it as the handler for [segname]_BloomPosting.xxx files and it would then have to examine the [segname].blm file to discover and instantiate the choice of delegate PostingsFormat using the standard service registration mechanism. At write-time clients would need to instantiate the BloomFilterPostingsFormat, passing a choice of PostingsFormat delegate to the constructor. At read-time Lucene core would invoke a zero-arg constructor. I'll look into this as an approach. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterCodec40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285706#comment-13285706 ] Mark Harwood commented on LUCENE-4069: -- Aaaargh. Unless I've missed something, I have concerns with the fundamental design of the current Codec loading mechanism. It seems too tied to the concept of a ServiceProvider class-loading mechanism, forcing users to write new SPI-registered classes in order to simply declare what amount to index schema configuration choices. Example: If I take Rob's sample Codec above and choose to use a subtly different configuration of the same PostingsFormat class for different fields it breaks: {code:title=ThisBreaks.java} Codec fooCodec=new Lucene40Codec() { @Override public PostingsFormat getPostingsFormatForField(String field) { if ("text".equals(field)) { return new FooPostingsFormat(1); } if ("title".equals(field)) { //same impl as "text" field, different constructor settings return new FooPostingsFormat(2); } return super.getPostingsFormatForField(field); } }; {code} This causes a file overwrite error as PerFieldPostingsFormat uses the same name from FooPostingsFormat(1) and FooPostingsFormat(2) to create files. In order to safely make use of differently configured choices of the same PostingsFormat we are forced to declare a brand new subclass with a unique new service name and entry in the service provider registration. This is essentially where I have got to in trying to integrate this Bloom filtering logic. This dependency on writing custom classes seems to make everything a bit fragile, no? What hope has Luke got in opening the average index without careful assembly of classpaths etc? If I contrast this with the world of database schemas it seems absurd to have a reliance on writing custom classes with no behaviour simply in order to preserve a configuration of an application's schema settings. Even an IOC container with XML declarations would offer a more agile means of assembling pre-configured *beans* rather than relying on a Service Provider mechanism that is only serving as a registry of *classes*. Anyone else see this as a major pain? > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterCodec40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285744#comment-13285744 ] Mark Harwood commented on LUCENE-4069: -- This fails if you add docs with title and text fields: {code:title=ThisCrashes.java} Codec fooCodec=new Lucene40Codec() { @Override public PostingsFormat getPostingsFormatForField(String field) { if ("text".equals(field)) { return new MemoryPostingsFormat(false); } if ("title".equals(field)) { return new MemoryPostingsFormat(true); } else { return super.getPostingsFormatForField(field); } } }; {code} Exception in thread "main" java.io.IOException: Cannot overwrite: C:\temp\luceneCodecs\_2_Memory.ram This also fails: {code:title=ThisToo.java} Codec fooCodec=new Lucene40Codec() { SimpleTextPostingsFormat theSimple = new SimpleTextPostingsFormat(); @Override public PostingsFormat getPostingsFormatForField(String field) { if ("text".equals(field)) { return new SimpleTextPostingsFormat(); } if ("title".equals(field)) { return new SimpleTextPostingsFormat(); } else { return super.getPostingsFormatForField(field); } } }; {code} with Exception in thread "main" java.io.IOException: Cannot overwrite: C:\temp\luceneCodecs\_1_SimpleText.pst Whereas sharing the same instance of a PostingsFormat class across fields works: {code:title=ThisWorks.java} Codec fooCodec=new Lucene40Codec() { SimpleTextPostingsFormat theSimple = new SimpleTextPostingsFormat(); @Override public PostingsFormat getPostingsFormatForField(String field) { if (("text".equals(field))|| ("title".equals(field))) { return theSimple; } else { return super.getPostingsFormatForField(field); } } }; {code} > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterCodec40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285765#comment-13285765 ] Mark Harwood commented on LUCENE-4069: -- bq. its just an issue with PerFieldPostingsFormat OK, thanks. My guess is you'll effectively be having to supplement postingsformat.getName() with object-instanceID in file names. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterCodec40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285773#comment-13285773 ] Mark Harwood commented on LUCENE-4069: -- bq. When I run all tests with the bloom 4.0 postings format (ant test-core -Dtests.postingsformat=BloomFilteredLucene40PostingsFormat), Thanks for the pointer on targeting codec testing. I have another patch to upload with various tweaks e.g. configurable choice of hash functions, RandomCodec additions so I will concentrate testing on that before uploading. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterCodec40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4090) PerFieldPostingsFormat cannot use name as suffix
[ https://issues.apache.org/jira/browse/LUCENE-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286411#comment-13286411 ] Mark Harwood commented on LUCENE-4090: -- Thanks for the quick fix, Rob :) Working fine for me here now. > PerFieldPostingsFormat cannot use name as suffix > > > Key: LUCENE-4090 > URL: https://issues.apache.org/jira/browse/LUCENE-4090 > Project: Lucene - Java > Issue Type: Bug > Components: core/index >Affects Versions: 4.0 >Reporter: Robert Muir >Assignee: Robert Muir > Fix For: 4.0, 5.0 > > Attachments: LUCENE-4090.patch, LUCENE-4090.patch > > > Currently PFPF just records the name in the metadata, which matches up to the > segment suffix. But this isnt enough, e.g. someone can use Pulsing(1) on one > field and Pulsing(2) on another field. > See Mark Harwood's examples struggling with this on LUCENE-4069. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostings40.patch This is looking more promising. Running "ant test-core -Dtests.postingsformat=TestBloomFilteredLucene40Postings" now passes all tests but causes OOM exception on 3 tests: * TestConsistentFieldNumbers.testManyFields * TestIndexableField.testArbitraryFields * TestIndexWriter.testManyFields Any pointers on how to annotate or otherwise avoid the BloomFilter class for "many-field" tests would be welcome. These are not realistic tests for this class (we don't expect indexes with 100s of primary-key like fields). In this patch I've * added an SPI lookup mechanism for pluggable hash algos. * documented the file format * fixed issues with TermVector tests * changed the API To use: BloomFilteringPostingFormat now takes a delegate PostingsFormat and a set of field names that are to have bloom-filters created. Fields that are not listed in the filter set can be safely indexed as per normal and doing so is beneficial because it allows filtered and non filtered field data to co-exist in the same physical files created by the delegate PostingsFormat. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index > Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterCodec40.patch, BloomFilterPostings40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterCodec40.patch) > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostings40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostings40.patch) > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostings40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostings40.patch Added missing class > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostings40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286598#comment-13286598 ] Mark Harwood commented on LUCENE-4069: -- bq. Instead i think the concrete Bloom+Lucene40 that you have in tests should be moved into src/java and registered there What problem would that be trying to solve? Registration (or creation) of any BloomFilteringPostingsFormat subclasses is not necessary to decode index contents. Offering a "Bloom40" would only buy users a pairing of Lucene40Postings and Bloom filtering but they would still have to declare which fields they want Bloom filtering on at write time. This isn't too hard using the code in the existing patch: {code:title=ThisWorks.java} final SetbloomFilteredFields=new HashSet(); bloomFilteredFields.add(PRIMARY_KEY_FIELD_NAME); iwc.setCodec(new Lucene40Codec(){ BloomFilteringPostingsFormat postingOptions=new BloomFilteringPostingsFormat(new Lucene40PostingsFormat(), bloomFilteredFields); @Override public PostingsFormat getPostingsFormatForField(String field) { return postingOptions; } }); {code} No extra subclasses/registration required here to read the index built with the above setup. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostings40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286600#comment-13286600 ] Mark Harwood commented on LUCENE-4069: -- bq. An alternative would be to just pick this less often in RandomCodec: see the SimpleText hack Another option might be to make the TestBloomFilteredLucene40Postings pick a ludicrously small Bitset sizing option for each field so that we can accommodate tests that create silly numbers of fields. The bitsets being so small will just quickly reach saturation and force all reads to hit the underlying FieldsProducer. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostings40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286616#comment-13286616 ] Mark Harwood commented on LUCENE-4069: -- bq. I dont understand why this handles fields. Someone should just pick that with perfieldpostingsformat. That would be inefficient because your PFPF will see BloomFilteringPostingsFormat(field1 + Lucene40) and BloomFilteringPostingsFormat(field2 + Lucene40) as fundamentally different PostingsFormat instances and consequently create multiple files named differently because it assumes these instances may be capable of using radically different file structures. In reality, the choice of BloomFilter with field 1 or BloomFilter with field 2 or indeed no BloomFilter does not fundamentally alter the underlying delegate PostingFormat's file format - it only adds a supplementary "blm" file on the side with the field summaries. For this reason it is a mistake to configure seperate BloomFilterPostingsFormat instances on a per-field basis if they can share a common delegate. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostings40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286707#comment-13286707 ] Mark Harwood commented on LUCENE-4069: -- bq. To solve what you speak of we just need to resolve LUCENE-4093. Presumably the main objective here is that in order to cut down on the number of files we store, content consumers of various types should aim to consolidate multiple fields' contents into a single file (if they share common config choices). bq. Then multiple postings format instances that are 'the same' will be deduplicated correctly. The complication in this case is that we essentially have 2 consumers (Bloom and Lucene40), one wrapped in the other with different but overlapping choices of fields e.g we want a single Lucene40 to process all fields but we want Bloom to handle only a subset of these fields. This will be a tough one for PFPF to untangle while we are stuck with a delegating model for composing consumers. This may be made easier if instead of delegating a single stream we have a *stream-splitting* capability via a multicast subscription e.g. Bloom filtering consumer registers interest in content streams for fields A and B while Lucene40 is consolidating content from fields A, B, C and D. A broadcast mechanism feeds each consumer a copy of the relevant stream and each consumer is responsible for inventing their own file-naming convention that avoids muddling files. While that may help for writing streams it doesn't solve the re-assembly of "producer" streams at read-time where BloomFilter absolutely has to position itself in front of the standard Lucene40 producer in order to offer fast-fail lookups. In the absence of a fancy optimised routing mechanism (this all may be overkill) my current solution was to put BloomFilter in the delegate chain armed with a subset of fieldnames to observe as a larger array of fields flow past to a common delegate. I added some Javadocs to describe the need to do it this way for an efficient configuration. You are right that this is messy (ie open to bad configuration) but operating this deep down in Lucene that's always a possibility regardless of what we put in place. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostings40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286754#comment-13286754 ] Mark Harwood commented on LUCENE-4069: -- Its true to say that Bloom is a different case to Pulsing - Bloom does not interfere in any with the normal recording of content in the wrapped delegate whereas Pulsing does. It may prove useful for us to mark a formal distinction between these mutating/non mutating types so we can treat them differently and provide optimisations? bq. And separately, you can always contain the number of files even today by using only unique instances yourself when writing Contained but not optimal - roughly double the number of required files if I want the common case of a primary key indexed with Bloom. I can't see a way of indexing with Bloom-plus-Lucene40 on field "A" and indexing with just Lucene40 on fields B,C and D and winding up with only one Lucene40 set of files with a common segment suffix. The way I did find of achieving this was to add a "bloomFilteredFields" set into my single Bloom+Lucene40 instance used for all fields. Is there any other option here currently? Looking to the future, 4093 may have more capabilities at optimising if it understands the distinction between mutating wrappers and non-mutating ones and how they are composed? > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index > Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostings40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286815#comment-13286815 ] Mark Harwood commented on LUCENE-4069: -- bq. Its not worth the complexity There's no real added complexity in BloomFilterPostingsFormat - it has to be capable of storing blooms for >1 field anyway and using the fieldname set is roughly 2 extra lines of code to see if a TermsConsumer needs wrapping or not. >From a client side you don't have to use this feature - the fieldname set can >be null in which case it will wrap all fields sent its way. If you do chose to >supply a set the wrapped PostingsFormat will have the advantage of being >shared for bloomed and non-bloomed fields. We could add a constructor that >removes the set and mark the others "expert". For me this falls into one of the many faster-if-you-know-about-it optimisations like FieldSelectors or recycling certain objects. Basically a useful hint to Lucene to save some extra effort but one which you dont *need* to use. Lucene-4093 may in future resolve the multi-file issue but I'm not sure it will do so without significant complication. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostings40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286916#comment-13286916 ] Mark Harwood commented on LUCENE-4069: -- bq. why is this a speed improvement? Sorry - misleading. Replace the word "faster" in my comment with "better" and that makes more sense - I mean better in terms of resource usage and reduced open file handles. This seemed relevant given the earlier comments about Solr's use of non-compound files: bq. [Solr] create massive amounts of files if we did so (add to the fact it disables compound files by default and its a disaster...) I can see there is a useful simplification being sought for here if PerFieldPF can consider each of the unique top-level PFs presented to it as looking after an exclusive set of files. As the centralised allocator of file names it can then simply call each unique PF with a choice of segment suffix to name its various files without conflicting with other PFs. Lucene 4093 is all about better determining which PF is unique using .equals(). Unfortunately I don't think this approach is sufficiently complex. In order to avoid allocating all unnecessary file names PerFieldPF would have to further understand the nuances of which PFs were being wrapped by other PFs and which wrapped PFs would be reusable outside of their wrapped PF (as is the case with BloomPF's wrapped PF). That seems a more complex task than implementing equals(). So it seems we have 3 options: 1) Ignore the problems of creating too many files in the case of BloomPF and any other examples of "wrapping" PFs 2) Create a PerFieldPF implementation that reuses wrapped PFs using some generic means of discovering recyclable wrapped PFs (i.e go further than what 4093 currently proposes in adding .equals support) 3) Retain my BloomPF-specific solution to the problem for those prepared to use lower-level APIs. Am I missing any other options and which one do you want to go for? > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostings40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287258#comment-13287258 ] Mark Harwood commented on LUCENE-4069: -- I've thought some more about option 2 (PerFieldPF reusing wrapped PFs) and it looks to get very ugly very quickly. There's only so much PerFieldPF can do to rationalize a random jumble of PF instances presented to it by clients. I think the right place to draw the line is Lucene-4093 i.e. a simple .equals() comparison on top-level PFs to eliminate any duplicates. Any other approach that also tries to de-dup nested PFs looks to be adding a lot of complexity, especially when you consider what that does to the model of read-time object instantiation. This would be significant added complexity to solve a problem you have already suggested is insignificant (i.e. too many files doesn't really matter when using CFS). I can remove the per-field stuff from BloomPF if you want but I imagine I will routinely subclass it to add this optimisation back in to my apps. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostings40.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3772) Highlighter needs the whole text in memory to work
[ https://issues.apache.org/jira/browse/LUCENE-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13476044#comment-13476044 ] Mark Harwood commented on LUCENE-3772: -- For bigger-than-memory docs is it not possible to use nested documents to represent subsections (e.g. a child doc for each of the chapters in a book) and then use BlockJoinQuery to select the best child docs? Highlighting can then be used on a more-manageable subset of the original content and Lucene's ranking algos are being used to select the best "fragment" rather than the highlighter's own attempts to reproduce this logic. Obviously depends on the shape of your content/queries but books-and-chapters is probably a good fit for this approach. > Highlighter needs the whole text in memory to work > -- > > Key: LUCENE-3772 > URL: https://issues.apache.org/jira/browse/LUCENE-3772 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Affects Versions: 3.5 > Environment: Windows 7 Enterprise x64, JRE 1.6.0_25 >Reporter: Luis Filipe Nassif > Labels: highlighter, improvement, memory > > Highlighter methods getBestFragment(s) and getBestTextFragments only accept a > String object representing the whole text to highlight. When dealing with > very large docs simultaneously, it can lead to heap consumption problems. It > would be better if the API could accept a Reader objetct additionally, like > Lucene Document Fields do. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3950) Attempting postings="BloomFilter" results in UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SOLR-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13476854#comment-13476854 ] Mark Harwood commented on SOLR-3950: BloomFilterPostingsFormat is designed to wrap another choice of PostingsFormat and adds ".blm" files to the other files created by the choice of delegate. However your code has instantiated a BloomFilterPostingsFormat without passing a choice of delegate - presumably using the zero-arg constructor. The comments in the code for this zero-arg constructor state: // Used only by core Lucene at read-time via Service Provider instantiation - // do not use at Write-time in application code. > Attempting postings="BloomFilter" results in UnsupportedOperationException > -- > > Key: SOLR-3950 > URL: https://issues.apache.org/jira/browse/SOLR-3950 > Project: Solr > Issue Type: Bug >Affects Versions: 4.1 > Environment: Linux bigindy5 2.6.32-279.9.1.el6.centos.plus.x86_64 #1 > SMP Wed Sep 26 03:52:55 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux > [root@bigindy5 ~]# java -version > java version "1.7.0_07" > Java(TM) SE Runtime Environment (build 1.7.0_07-b10) > Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode) >Reporter: Shawn Heisey > Fix For: 4.1 > > > Tested on branch_4x, checked out after BlockPostingsFormat was made the > default by LUCENE-4446. > I used 'ant generate-maven-artifacts' to create the lucene-codecs jar, and > copied it into my sharedLib directory. When I subsequently tried > postings="BloomFilter" I got a the following exception in the log: > {code} > Oct 15, 2012 11:14:02 AM org.apache.solr.common.SolrException log > SEVERE: java.lang.UnsupportedOperationException: Error - > org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat has been > constructed without a choice of PostingsFormat > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3950) Attempting postings="BloomFilter" results in UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SOLR-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477036#comment-13477036 ] Mark Harwood commented on SOLR-3950: bq. If there is some schema config that will tell Solr to do the right thing, please let me know. Right now BloomPF is like an abstract class - you need to fill-in-the-blanks as to what delegate it will use before you can use it at write-time. I think we have 3 options: 1) Solr (or you) provide a new PF impl that weds BloomPF with a choice of PF e.g. Lucene40PF so you would have a zero-arg-constructor class named something like BloomLucene40PF or... 2) Solr extends config file format to provide a generic means of assembling "wrapper" PFs like Bloom in their config e.g: postingsFormat="BloomFilter" delegatePostingsFormat="FooPF" and Solr then does reflection magic to call constructors appropriately or.. 3) Core Lucene is changed so that BloomPF is wedded to a default PF (e.g. Lucene40PF) if users e.g. Solr fail to nominate a choice of delegate for BloomPF. Of these 1) feels like "the right thing". Cheers Mark > Attempting postings="BloomFilter" results in UnsupportedOperationException > -- > > Key: SOLR-3950 > URL: https://issues.apache.org/jira/browse/SOLR-3950 > Project: Solr > Issue Type: Bug >Affects Versions: 4.1 > Environment: Linux bigindy5 2.6.32-279.9.1.el6.centos.plus.x86_64 #1 > SMP Wed Sep 26 03:52:55 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux > [root@bigindy5 ~]# java -version > java version "1.7.0_07" > Java(TM) SE Runtime Environment (build 1.7.0_07-b10) > Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode) >Reporter: Shawn Heisey > Fix For: 4.1 > > > Tested on branch_4x, checked out after BlockPostingsFormat was made the > default by LUCENE-4446. > I used 'ant generate-maven-artifacts' to create the lucene-codecs jar, and > copied it into my sharedLib directory. When I subsequently tried > postings="BloomFilter" I got a the following exception in the log: > {code} > Oct 15, 2012 11:14:02 AM org.apache.solr.common.SolrException log > SEVERE: java.lang.UnsupportedOperationException: Error - > org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat has been > constructed without a choice of PostingsFormat > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files
Mark Harwood created LUCENE-4275: Summary: Threaded tests with MockDirectoryWrapper delete active PostingFormat files Key: LUCENE-4275 URL: https://issues.apache.org/jira/browse/LUCENE-4275 Project: Lucene - Core Issue Type: Bug Components: core/codecs, general/test Affects Versions: 4.0-ALPHA Environment: Win XP 64bit Sun JDK 1.6 Reporter: Mark Harwood Fix For: 4.0 As part of testing Lucene-4069 I have encountered sporadic issues with files going missing. I believe this is a bug in the test framework (multi-threading issues in MockDirectoryWrapper?) so have raised a separate issue with simplified test PostingFormat class here. Using this test PF will fail due to a missing file roughly one in four times of executing this test: ant test-core -Dtestcase=TestIndexWriterCommit -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files
[ https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4275: - Attachment: Lucene-4275-TestClass.patch Attached simple PostingsFormat used to illustrate cases of files going missing in PF tests. > Threaded tests with MockDirectoryWrapper delete active PostingFormat files > -- > > Key: LUCENE-4275 > URL: https://issues.apache.org/jira/browse/LUCENE-4275 > Project: Lucene - Core > Issue Type: Bug > Components: core/codecs, general/test >Affects Versions: 4.0-ALPHA > Environment: Win XP 64bit Sun JDK 1.6 > Reporter: Mark Harwood > Fix For: 4.0 > > Attachments: Lucene-4275-TestClass.patch > > > As part of testing Lucene-4069 I have encountered sporadic issues with files > going missing. I believe this is a bug in the test framework (multi-threading > issues in MockDirectoryWrapper?) so have raised a separate issue with > simplified test PostingFormat class here. > Using this test PF will fail due to a missing file roughly one in four times > of executing this test: > ant test-core -Dtestcase=TestIndexWriterCommit > -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE > -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat > -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files
[ https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425895#comment-13425895 ] Mark Harwood commented on LUCENE-4275: -- Thanks, Rob. This test requires a call to "ant clean" between each run before it will consistently work. However, I don't consider that a fix and assume that we are still looking for a bug here as there's an index consistency issue lurking somewhere here. I've tried adding the setting -Dtests.directory=RAMDirectory but the test still looks to have some "memory" between runs. I added some logging of creates and deletes as you suggest and it looks like on a second, un-cleansed run, my PF is being called to open a high-numbered segment which I suspect was created by an earlier run as the logging doesn't show signs of the PF being asked to created content for this (or any other) segment as part of the current run. At this point it fails as there is no longer a copy of the "foobar" file listed by the directory. I have noticed in the logs from previous runs MDW is asked to delete the segment's "foobar" file by IndexWriter as part of compaction into a compound CFS. Hope this sheds some light as I'm finding this a complex one to debug. > Threaded tests with MockDirectoryWrapper delete active PostingFormat files > -- > > Key: LUCENE-4275 > URL: https://issues.apache.org/jira/browse/LUCENE-4275 > Project: Lucene - Core > Issue Type: Bug > Components: core/codecs, general/test > Affects Versions: 4.0-ALPHA > Environment: Win XP 64bit Sun JDK 1.6 >Reporter: Mark Harwood > Fix For: 4.0 > > Attachments: Lucene-4275-TestClass.patch > > > As part of testing Lucene-4069 I have encountered sporadic issues with files > going missing. I believe this is a bug in the test framework (multi-threading > issues in MockDirectoryWrapper?) so have raised a separate issue with > simplified test PostingFormat class here. > Using this test PF will fail due to a missing file roughly one in four times > of executing this test: > ant test-core -Dtestcase=TestIndexWriterCommit > -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE > -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat > -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files
[ https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426481#comment-13426481 ] Mark Harwood commented on LUCENE-4275: -- Nailed it, Mike. Yet another beer I owe you. I removed the IllegalStateException and it looks like the retry logic is now kicking in and all tests pass This reliance on throwing a particular exception type feels like an important contract to document. Currently the comments in PostingsFormat.fieldsProducer() read as follows: bq. Reads a segment. NOTE: by the time this call returns, it must hold open any files it will need to use; else, those files may be deleted. I propose adding: bq. Additionally, required files may be deleted during the execution of this call before there is a chance to open them. Under these circumstances an IOException should be thrown by the implementation. IOExceptions are expected and will automatically cause a retry of the segment opening logic with the newly revised segments I'll roll that documentation addition into my Lucene-4069 patch > Threaded tests with MockDirectoryWrapper delete active PostingFormat files > -- > > Key: LUCENE-4275 > URL: https://issues.apache.org/jira/browse/LUCENE-4275 > Project: Lucene - Core > Issue Type: Bug > Components: core/codecs, general/test >Affects Versions: 4.0-ALPHA > Environment: Win XP 64bit Sun JDK 1.6 >Reporter: Mark Harwood > Fix For: 4.0 > > Attachments: Lucene-4275-TestClass.patch > > > As part of testing Lucene-4069 I have encountered sporadic issues with files > going missing. I believe this is a bug in the test framework (multi-threading > issues in MockDirectoryWrapper?) so have raised a separate issue with > simplified test PostingFormat class here. > Using this test PF will fail due to a missing file roughly one in four times > of executing this test: > ant test-core -Dtestcase=TestIndexWriterCommit > -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE > -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat > -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files
[ https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood closed LUCENE-4275. Resolution: Not A Problem > Threaded tests with MockDirectoryWrapper delete active PostingFormat files > -- > > Key: LUCENE-4275 > URL: https://issues.apache.org/jira/browse/LUCENE-4275 > Project: Lucene - Core > Issue Type: Bug > Components: core/codecs, general/test >Affects Versions: 4.0-ALPHA > Environment: Win XP 64bit Sun JDK 1.6 > Reporter: Mark Harwood > Fix For: 4.0 > > Attachments: Lucene-4275-TestClass.patch > > > As part of testing Lucene-4069 I have encountered sporadic issues with files > going missing. I believe this is a bug in the test framework (multi-threading > issues in MockDirectoryWrapper?) so have raised a separate issue with > simplified test PostingFormat class here. > Using this test PF will fail due to a missing file roughly one in four times > of executing this test: > ant test-core -Dtestcase=TestIndexWriterCommit > -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE > -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat > -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostingsBranch4x.patch) > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: 4069Failure.zip, LUCENE-4069-tryDeleteDocument.patch, > LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch Updated with fix to issue explored in Lucene-4275 > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostingsBranch4x.patch) > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: 4069Failure.zip, LUCENE-4069-tryDeleteDocument.patch, > LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch Updated patch to bring in line with latest core API changes. All tests now pass clean so will commit soon > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood resolved LUCENE-4069. -- Resolution: Fixed Assignee: Mark Harwood Committed to 4.0 branch, revision 1368442 > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood > Assignee: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13427322#comment-13427322 ] Mark Harwood commented on LUCENE-4069: -- Will do. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Fix For: 4.0 > > Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Fix Version/s: 5.0 Applied to trunk in revision 1368567 > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Mark Harwood > Assignee: Mark Harwood >Priority: Minor > Fix For: 4.0, 5.0 > > Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433045#comment-13433045 ] Mark Harwood commented on LUCENE-4069: -- bq. Removing misleading 2X perf gain: it seems to depend heavily on the exact use case. Fair enough - the original patch targeted Lucene 3.6 which benefited more heavily from this technique. The issue then morphed into a 4.x patch where performance gains were harder to find. I think the sweet spot is in primary key searches on indexes with ongoing heavy changes (more segment fragmentation, less OS-level caching?). This is the use case I am targeting currently and my final tests using our primary-key-counting test rig saw a 10 to 15% improvement over Pulsing. bq. I'm asking because I need his feature but I'm stuck with 3.x for a while. I have a client in a similar situation who are contemplating using the 3.6 patch. bq. Is there bugs which should be fixed in initial 3.6 patch? It has been a while since I looked at it - a quick run of "ant test" on my copy here showed no errors. I will be giving it a closer review if my client decides to go down this route and can post any fixes here. I expect if you use the patch and get into trouble you can use an un-patched version of 3.6 to read the same index files (it should just ignore the extra "blm" files created by the patched version). > Segment-level Bloom filters > --- > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index > Affects Versions: 3.6, 4.0-ALPHA > Reporter: Mark Harwood >Assignee: Mark Harwood >Priority: Minor > Fix For: 4.0-BETA, 5.0 > > Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4369) StringFields name is unintuitive and not helpful
[ https://issues.apache.org/jira/browse/LUCENE-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452900#comment-13452900 ] Mark Harwood commented on LUCENE-4369: -- SingleTermField ? Not sure "matching vs searching" is a commonly understood differentiation. > StringFields name is unintuitive and not helpful > > > Key: LUCENE-4369 > URL: https://issues.apache.org/jira/browse/LUCENE-4369 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir > Attachments: LUCENE-4369.patch > > > There's a huge difference between TextField and StringField, StringField > screws up scoring and bypasses your Analyzer. > (see java-user thread "Custom Analyzer Not Called When Indexing" as an > example.) > The name we use here is vital, otherwise people will get bad results. > I think we should rename StringField to MatchOnlyField. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4369) StringFields name is unintuitive and not helpful
[ https://issues.apache.org/jira/browse/LUCENE-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452914#comment-13452914 ] Mark Harwood commented on LUCENE-4369: -- Agreed on the need for a change - names are important. I have a problem with using "match" on its own because the word is often associated with partial matching e.g. "best match" or "fuzzy match". A quick google suggests "match" has more connotations with fuzziness than exactness - there are 162m results for "best match" vs only 45m results for "exact match". So how about "ExactMatchField"? > StringFields name is unintuitive and not helpful > > > Key: LUCENE-4369 > URL: https://issues.apache.org/jira/browse/LUCENE-4369 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir > Attachments: LUCENE-4369.patch > > > There's a huge difference between TextField and StringField, StringField > screws up scoring and bypasses your Analyzer. > (see java-user thread "Custom Analyzer Not Called When Indexing" as an > example.) > The name we use here is vital, otherwise people will get bad results. > I think we should rename StringField to MatchOnlyField. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostings40.patch) > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: MHBloomFilterOn3.6Branch.patch, > PrimaryKey40PerformanceTestSrc.zip > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch Updated as follows: * Extracted Bloom filter functionality as new oal.util.FuzzySet class - the name is changed because Bloom filtering is one application for a FuzzySet, fuzzy count distincts being another. * BloomFilterPostingsFormat now take a factory that can tailor choice of BloomFilter per field (bitset size/saturation settings and choice of hash algo). Provided a default factory implementation. * All Unit tests pass now that I have a test PostingsFormat class that uses v small bitsets where before the many-field unit tests would cause OOM. Will follow up with benchmarks when I have more time to run and document them. Initial results from my large-scale tests on growing indexes show a nice flat line in the face of a growing index whereas a non-Bloomed index saw-tooths upwards as segments grow/merge. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostingsBranch4x.patch, > MHBloomFilterOn3.6Branch.patch > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: PrimaryKey40PerformanceTestSrc.zip) > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostingsBranch4x.patch, > MHBloomFilterOn3.6Branch.patch > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: PrimaryKeyPerfTest40.java Benchmark tool adapted from Mike's original Pulsing codec benchmark. Now includes Bloom postings example. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostingsBranch4x.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13395773#comment-13395773 ] Mark Harwood commented on LUCENE-4069: -- Interesting results, Mike - thanks for taking the time to run them. bq. BloomFilteredFieldsProducer should just pass through intersect to the delegate? I have tried to make the BloomFilteredFieldsProducer get out of the way of the client app and the delegate PostingsFormat as soon as it is safe to do so i.e. when the user is safely focused on a non-filtered field. While there is a chance the client may end up making a call to TermsEnum.seekExact(..) on a filtered field then I need to have a wrapper object in place which is in a position to intercept this call. In all other method invocations I just end up delegating calls so I wonder if all these extra method calls are the cause of the slowdown you see e.g. when Fuzzy is enumerating over many terms. The only other alternatives to endlessly wrapping in this way are: a) API change - e.g. allow TermsEnum.seekExact to have a pluggable call-out for just this one method. b) Mess around with byte-code manipulation techniques to weave in Bloom filtering(the sort of thing I recall Hibernate resorts to) Neither of these seem particularly appealing options so I think we may have to live with fuzzy+bloom not being as fast as straight fuzzy. For completeness sake - I don't have access to your benchmarking code but I would hope that PostingsFormat.fieldsProducer() isn't called more than once for the same segment as that's where the Bloom filters get loaded from disk so there's inherent cost there too. I can't imagine this is the case. BTW I've just finished a long-running set of tests which mixes up reads and writes here: http://goo.gl/KJmGv This benchmark represents how graph databases such as Neo4j use Lucene for an index when loading (I typically use the Wikipedia links as a test set). I look to get a 3.5 x speed up in Lucene 4 and Lucene 3.6 gets nearly 9 x speedup over the comparatively slower 3.6 codebase. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostingsBranch4x.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13395857#comment-13395857 ] Mark Harwood commented on LUCENE-4069: -- bq. I think the fix is simple: you are not overriding Terms.intersect now, in BloomFilteredTerms Good catch - a quick test indeed shows a speed up on fuzzy queries. I'll prepare a new patch. I'm not sure on why 3.6+Bloom is faster than 4+Bloom in my tests. I'll take a closer look at your benchmark. > Segment-level Bloom filters for a 2 x speed up on rare term searches > > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index >Affects Versions: 3.6, 4.0 >Reporter: Mark Harwood >Priority: Minor > Fix For: 4.0, 3.6.1 > > Attachments: BloomFilterPostingsBranch4x.patch, > MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org