ported lucandra: lucene index on HBase
Hi, Lucandra stores a lucene index on cassandra: http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend As the author of lucandra writes: I’m sure something similar could be built on hbase. So here it is: http://github.com/thkoch2001/lucehbase This is only a first prototype which has not been tested on anything real yet. But if you're interested, please join me to get it production ready! I propose to keep this thread on hbase-user and java-dev only. Would it make sense to aim this project to become an hbase contrib? Or a lucene contrib? Best regards, Thomas Koch, http://www.koch.ro - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2215) paging collector
[ https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849639#action_12849639 ] Michael McCandless commented on LUCENE-2215: This is a neat collector! I like the idea of chaining/filtering... couldn't we put this in core (under TFC/TSDC.create), but instead of doubling the 12 specialized (anonymous) impls we now have, just delegate? Ie, we'd make a FilteredCollector, taking another collector when it's created, and then on every collect call, only if the hit is weak enough (ie is worse than what the app provided as prev low score/doc) would it forward it to the delegate? I guess we should test perf w/ (the new additions to benchmark -- yay!) to see if specializing the code (even anonymously) is warranted. The indent whitespace needs to fixed to 2 spaces... paging collector Key: LUCENE-2215 URL: https://issues.apache.org/jira/browse/LUCENE-2215 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.4, 3.0 Reporter: Adam Heinz Assignee: Grant Ingersoll Priority: Minor Attachments: IterablePaging.java, LUCENE-2215.patch, PagingCollector.java, TestingPagingCollector.java http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898 Somebody assign this to Aaron McCurry and we'll see if we can get enough votes on this issue to convince him to upload his patch. :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
On Mon, Mar 22, 2010 at 12:45 PM, Marvin Humphrey mar...@rectangular.com wrote: On Thu, Mar 18, 2010 at 05:16:23AM -0500, Michael McCandless wrote: Also, will Lucy store the original stats? These? * Total number of tokens in the field. * Number of unique terms in the field. * Doc boost. * Field boost. Also sum(tf). Robert can generate more :) That would depend on which Similiarity the user specs for that field. In other words, it's just another data-reduction decision: if the Sim needs it, keep it, and if doesn't, throw it away. OK. Incidentally, what are you planning to do about field boost if it's not always 1.0? Are you going to store full 32-bit floats? For starters, yes. We may (later) want to make a new attr that sets the #bits (levels/precision) you want... then uses packed ints to encode. Ie so the chosen Sim can properly recompute all boost bytes (if it uses those), for scoring models that pivot based on avg's of these stats? Yes, we could support that. It's not high on my todo-list for core Lucy, though: poor payoff for all the complexity it would introduce, particularly file format complexity with its heavy backwards compatibility burden. Right now, we only have the boost bytes, and the fact that they are used for length normalization, field boost, and doc boost is incidental. If we add all the raw stats, that's a bunch of stuff we have to support for a long time, yet which doesn't yield practical advantages for us yet. I'd be much more interested in finding a way to support such a feature as an extension. I was specifically asking if Lucy will allow the user to force true average to be recomputed, ie, at commit time from the writer. It's more costly and often not needed (ie, once your index is large enough, new docs typically won't shift the average much). But I imagine some users will want true average. In any case, the proposal to start delaying Sim choice to search-time -- while a nice feature for Lucene -- is a non-starter for Lucy. We can't do that because it would kill the cheap-Searcher model to generate boost bytes at Searcher construction time and cache them within the object. We need those boost bytes written to disk so we can mmap them and share them amongst many cheap Searchers. It'd seem like Lucy could re-gen the boost bytes if a different Sim were selected, or, the current Sim hadn't yet computed cached its bytes? But then logically this means a reader needs write permission to the index dir, which is not good... Whatever's reading the boost bytes can't tell the difference between process RAM and mmap'd RAM, so write-permission on the index dir isn't required. Hmm if you could somehow soften this... so that a custom Sim could regen its boost bytes (if it needed to), write them into the index, and then whoever's reading can mmap... that'd buy you some flexibility back. What's trickier is that Schemas are not normally mutable, and that they are part of the index. You don't have to supply an Analyzer, or a Similarity, or anything else when opening a Searcher -- you just provide the location of the index, and the Schema gets deserialized from the latest schema_NNN.json file. That has many advantages, e.g. inadvertent Analyzer conflicts are pretty much a thing of the past for us. That's nice... though... is it too rigid? Do users even want to pick a different analyzer at search time? But it makes your feature request of runtime settability for Similarity awkward to implement: by the time you have a Schema object to work with, the Searcher is already open. Searcher searcher = new Searcher(/path/to/index); Schema schema = searcher.getSchema(); schema.setSim(content, altSim); // Too late, and not implemented anyway. I see... To my mind, these are all related data reduction tasks: * Omit doc-boost and field-boost, replacing them with a single float docXfield multiplier -- because you never need doc-boost on its own. * Omit length-in-tokens, term-cardinality, doc-boost, and field-boost, replacing them all with a single boost byte -- because for the kind of scoring you want to do, you don't need all those raw stats. * Omit the boost byte, because you don't need to do scoring at all. * Omit positions because you don't need PhraseQueries, etc. to match. I wouldn't group this one with the others -- I mean technically it is data reduction -- but omitting positions means certain queries (PhraseQuery) won't work even in match only searching. Whereas the rest of these examples affect how scoring is done (or whether it's done). Couldn't disagree more. Omitting positions is *exactly* the kind of data reduction task which we know is safe to perform when a user specifically tells us they don't need PhraseQueries by specifying a MinimalSimilarity. Hmmm... it just seems to be different categories to me. One category prevents certain kinds of queries
[jira] Updated: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-2345: -- Attachment: LUCENE-2345_3.0.patch Here's a patch against 3.0 that provides the SegmentReaderFactory ability (not tested yet, but i'll be doing that shortly as i integrate this functionality) It adds a SegmentReaderFactory. The IndexWriter now has a getter and setter for setting this SegmentReader has a new protected method init() which is called after the segment reader has been initialized (to allow subclasses to hook this action and do additional initialization, etc added 2 new IndexReader.open() calls that allow specifying the SegmentReaderFactory Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-2345_3.0.patch I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849728#action_12849728 ] Shai Erera commented on LUCENE-2345: bq. The IndexWriter now has a getter and setter for setting this If this is not expected to change during the lifetime of IW, I think it should be added to IWC when you upgrade the patch to 3.1. Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-2345_3.0.patch I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849731#action_12849731 ] Tim Smith commented on LUCENE-2345: --- that was my plan Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-2345_3.0.patch I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote: Also, will Lucy store the original stats? These? * Total number of tokens in the field. * Number of unique terms in the field. * Doc boost. * Field boost. Also sum(tf). Robert can generate more :) Hmm, aren't Total number of tokens in the field and sum(tf) normally equivalent? I guess there might be analyzers for which that isn't true, e.g. those which perform synonym-injection? In any case, sum(tf) is probably a better definition, because it makes no ancillary claims... Incidentally, what are you planning to do about field boost if it's not always 1.0? Are you going to store full 32-bit floats? For starters, yes. OK, how are those going to be encoded? IEEE 754? Big-endian? http://en.wikipedia.org/wiki/Endianness#Floating-point_and_endianness We may (later) want to make a new attr that sets the #bits (levels/precision) you want... then uses packed ints to encode. I'm concerned that the bit-wise entropy of floats may make them a poor match for compression via packed ints. We'll probably get a compressed representation which is larger than the original. Are there any standard algorithms out there for compressing IEEE 754 floats? RLE works, but only with certain data patterns. ... [ time passes ] ... Hmm, maybe not: http://stackoverflow.com/questions/2238754/compression-algorithm-for-ieee-754-data I was specifically asking if Lucy will allow the user to force true average to be recomputed, ie, at commit time from the writer. That's theoretically possible. We'd have to implement the reader the same way we have DeletionsReader -- the most recent segment may contain data which applies to older segments. Here's the DeletionsReader code, which searches backwards through the segments looking for a particular file: /* Start with deletions files in the most recently added segments and work * backwards. The first one we find which addresses our segment is the * one we need. */ for (i = VA_Get_Size(segments) - 1; i = 0; i--) { Segment *other_seg = (Segment*)VA_Fetch(segments, i); Hash *metadata = (Hash*)Seg_Fetch_Metadata_Str(other_seg, deletions, 9); if (metadata) { Hash *files = (Hash*)CERTIFY( Hash_Fetch_Str(metadata, files, 5), HASH); Hash *seg_files_data = (Hash*)Hash_Fetch(files, (Obj*)my_seg_name); if (seg_files_data) { Obj *count = (Obj*)CERTIFY( Hash_Fetch_Str(seg_files_data, count, 5), OBJ); del_count = (i32_t)Obj_To_I64(count); del_file = (CharBuf*)CERTIFY( Hash_Fetch_Str(seg_files_data, filename, 8), CHARBUF); break; } } } What we'd do is write the regenerated boost bytes for *all* segments to the most recent segment. It would be roughly analogous to building up an NRT reader. What's trickier is that Schemas are not normally mutable, and that they are part of the index. You don't have to supply an Analyzer, or a Similarity, or anything else when opening a Searcher -- you just provide the location of the index, and the Schema gets deserialized from the latest schema_NNN.json file. That has many advantages, e.g. inadvertent Analyzer conflicts are pretty much a thing of the past for us. That's nice... though... is it too rigid? Do users even want to pick a different analyzer at search time? It's not common. To my mind, the way a field is tokenized is part of its field definition, thus the Analyzer is part of the field definition, thus the analyzer is part of the schema and needs to be stored with the index. Still, we support different Analyzers at search time by way of QueryParser. QueryParser's constructor requires a Schema, but also accepts an optional Analyzer which if supplied will be used instead of the Analyzers from the Schema. Maybe aggressive automatic data-reduction makes more sense in the context of flexible matching, which is more expansive than flexible scoring? I think so. Maybe it shouldn't be called a Similarity (which to me (though, carrying a heavy curse of knowledge burden...) means scoring)? Matcher? Heh. Matcher is taken. It's a crucial class, too, roughly combining the roles of Lucene's Scorer and DocIDSetIterator. The first alternative that comes to mind is Relevance, because not only can one thing's relevance to another be continuously variable (i.e. score), it can also be binary: relevant/not-relevant (i.e. match). But I don't see why Relevance, Matcher, or anything else would be so much better than Similarity. I think this is your hang up. ;) I'm +0 (FWIW) on search-time Sim settability for Lucene. It's a nice feature, but I don't think we've worked out all the problems yet. If we can, I might switch to +1 (FWIW). What
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849806#action_12849806 ] Jason Rutherglen commented on LUCENE-2324: -- Michael, I'm guessing this patch needs to be updated as per LUCENE-2329? Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 Attachments: lucene-2324-no-pooling.patch See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849808#action_12849808 ] Jason Rutherglen commented on LUCENE-2324: -- Actually, I just browsed the patch again, I don't think it implements private doc writers as of yet? I think you're right, we can get this issue completed. LUCENE-2312's path looks clear at this point. Shall I take a whack at it? Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 Attachments: lucene-2324-no-pooling.patch See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-2324: -- Attachment: (was: lucene-2324-no-pooling.patch) Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849819#action_12849819 ] Michael Busch commented on LUCENE-2324: --- Hey Jason, Disregard my patch here. I just experimented with removal of pooling, but then did LUCENE-2329 instead. TermsHash and TermsHashPerThread are now much simpler, because all the pooling code is gone after 2329 was committed. Should make it a little easier to get this patch done. Sure it'd be awesome if you could provide a patch here. I can help you, we should just frequently post patches here so that we don't both work on the same areas. Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2215) paging collector
[ https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849843#action_12849843 ] Grant Ingersoll commented on LUCENE-2215: - Mike, don't you think, though, that through a fairly simple update of some of the clauses to appropriate short circuit things that we can just hook this into the existing collectors w/o no need for any delegation or changes? Let me try a patch. Now that the benchmark stuff is in, we should be able to test. paging collector Key: LUCENE-2215 URL: https://issues.apache.org/jira/browse/LUCENE-2215 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.4, 3.0 Reporter: Adam Heinz Assignee: Grant Ingersoll Priority: Minor Attachments: IterablePaging.java, LUCENE-2215.patch, PagingCollector.java, TestingPagingCollector.java http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898 Somebody assign this to Aaron McCurry and we'll see if we can get enough votes on this issue to convince him to upload his patch. :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849844#action_12849844 ] Jason Rutherglen commented on LUCENE-2324: -- Michael, I'm working on a patch and will post one (hopefully) shortly. Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2215) paging collector
[ https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849851#action_12849851 ] Uwe Schindler commented on LUCENE-2215: --- Hey, and I want to fix the NaN thing in TSDC: LUCENE-2271 Maybe when we delegate, we can also use my cool code that switches the delegate to remove on comparison after the queue is full. paging collector Key: LUCENE-2215 URL: https://issues.apache.org/jira/browse/LUCENE-2215 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.4, 3.0 Reporter: Adam Heinz Assignee: Grant Ingersoll Priority: Minor Attachments: IterablePaging.java, LUCENE-2215.patch, PagingCollector.java, TestingPagingCollector.java http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898 Somebody assign this to Aaron McCurry and we'll see if we can get enough votes on this issue to convince him to upload his patch. :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2215) paging collector
[ https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849863#action_12849863 ] Michael McCandless commented on LUCENE-2215: bq. ...through a fairly simple update of some of the clauses to appropriate short circuit things that we can just hook this into the existing collectors w/o no need for any delegation or changes? Let me try a patch. Now that the benchmark stuff is in, we should be able to test. This'd make me nervous... Ie I don't think we should insert bytecodes for the 99.9% of searches that wouldn't make use of this, even if we can't uncover a slowdown with benchmarking. We should still benchmark it though (I'm curious)... we should also benchmark the delegate solution. paging collector Key: LUCENE-2215 URL: https://issues.apache.org/jira/browse/LUCENE-2215 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.4, 3.0 Reporter: Adam Heinz Assignee: Grant Ingersoll Priority: Minor Attachments: IterablePaging.java, LUCENE-2215.patch, PagingCollector.java, TestingPagingCollector.java http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898 Somebody assign this to Aaron McCurry and we'll see if we can get enough votes on this issue to convince him to upload his patch. :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849899#action_12849899 ] Michael Busch commented on LUCENE-2324: --- Awesome! Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2346) Explore other in-memory postinglist formats for realtime search
Explore other in-memory postinglist formats for realtime search --- Key: LUCENE-2346 URL: https://issues.apache.org/jira/browse/LUCENE-2346 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 The current in-memory posting list format might not be optimal for searching. VInt decoding performance and the lack of skip lists would arguably be the biggest bottlenecks. For LUCENE-2312 we should investigate other formats. Some ideas: - PFOR or packed ints for posting slices? - Maybe even int[] slices instead of byte slices? This would be great for search performance, but the additional memory overhead might not be acceptable. - For realtime search it's usually desirable to evaluate the most recent documents first. So using backward pointers instead of forward pointers and having the postinglist pointer point to the most recent docID in a list is something to consider. - Skipping: if we use fixed-length postings ([packed] ints) we can do binary search within a slice. We can also locate a pointer then without scanning and thus skip entire slices quickly. Is that sufficient or would we need more skipping layers, so that it's possible to skip directly to particular slices? It would be awesome to find a format that doesn't slow down normal indexing, but is very efficient for in-memory searches. If we can't find such a fits-all format, we should have a separate indexing chain for real-time indexing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2347) Dump WordNet to SOLR Synonym format
Dump WordNet to SOLR Synonym format --- Key: LUCENE-2347 URL: https://issues.apache.org/jira/browse/LUCENE-2347 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Affects Versions: 3.0.1 Reporter: Bill Bell This enhancement allows you to dump v2 of WordNet to SOLR synonym format! Get all your syns loaded easily. 1. You can load all synonyms from http://wordnetcode.princeton.edu/2.0/ WordNet V2 to SOLR by first using the Sys2Index program http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/wordnet/Syns2Index.html Get WNprolog from http://wordnetcode.princeton.edu/2.0/ 2. We modified this program to work with SOLR (See attached) on amidev.kaango.com in /vol/src/lucene/contrib/wordnet vi /vol/src/lucene/contrib/wordnet/src/java/org/apache/lucene/wordnet/Syns2Solr.java 3. Run ant 4. java -classpath /vol/src/lucene/build/contrib/wordnet/lucene-wordnet-3.1-dev.jar org.apache.lucene.wordnet.Syns2Solr prolog/wn_s.pl solr index_synonyms.txt -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2347) Dump WordNet to SOLR Synonym format
[ https://issues.apache.org/jira/browse/LUCENE-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Bell updated LUCENE-2347: -- Attachment: Syns2Solr.java Dump WordNet to SOLR Synonym format --- Key: LUCENE-2347 URL: https://issues.apache.org/jira/browse/LUCENE-2347 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Affects Versions: 3.0.1 Reporter: Bill Bell Attachments: Syns2Solr.java This enhancement allows you to dump v2 of WordNet to SOLR synonym format! Get all your syns loaded easily. 1. You can load all synonyms from http://wordnetcode.princeton.edu/2.0/ WordNet V2 to SOLR by first using the Sys2Index program http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/wordnet/Syns2Index.html Get WNprolog from http://wordnetcode.princeton.edu/2.0/ 2. We modified this program to work with SOLR (See attached) on amidev.kaango.com in /vol/src/lucene/contrib/wordnet vi /vol/src/lucene/contrib/wordnet/src/java/org/apache/lucene/wordnet/Syns2Solr.java 3. Run ant 4. java -classpath /vol/src/lucene/build/contrib/wordnet/lucene-wordnet-3.1-dev.jar org.apache.lucene.wordnet.Syns2Solr prolog/wn_s.pl solr index_synonyms.txt -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2215) paging collector
[ https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849961#action_12849961 ] Grant Ingersoll commented on LUCENE-2215: - Yeah, but one could make the argument, Mike, that the existing optimizations are useless for the most common case, since I think it's safe to say most applications implement paging. Of course, that being said, most users don't page all that deeply. Also, for something like Solr that prefetches the top 50 it might not be good, either. Still, in my mind it is one additional boolean check, as in: {code} if ( (current stuff) || (pagingInfoPresent == true paging check) ) ... {code} pagingInfoPresent can be determined at construction time and that whole clause would be short circuited very quickly. That being said, delegation could be done at construction time, too and more cleanly separates things. I'll try to put up my version tomorrow. paging collector Key: LUCENE-2215 URL: https://issues.apache.org/jira/browse/LUCENE-2215 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.4, 3.0 Reporter: Adam Heinz Assignee: Grant Ingersoll Priority: Minor Attachments: IterablePaging.java, LUCENE-2215.patch, PagingCollector.java, TestingPagingCollector.java http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898 Somebody assign this to Aaron McCurry and we'll see if we can get enough votes on this issue to convince him to upload his patch. :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849965#action_12849965 ] Jason Rutherglen commented on LUCENE-2324: -- I'm a little confused in the flushedDocCount, remap deletes conversion portions of DocWriter. flushedDocCount is used as a global counter, however when we move to per thread doc writers, it won't be global anymore. Is there a different (easier) way to perform remap deletes? Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2215) paging collector
[ https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850002#action_12850002 ] Shai Erera commented on LUCENE-2215: bq. since I think it's safe to say most applications implement paging Let's be careful about the semantics here Grant. Most if not all applications implement paging indeed, but I believe only FEW actually store user contexts between searches. PagingCollector relies on the application to store the lowest ranking doc that was returned previously, which means storing context between user's searches. I agree w/ Mike's statement about 99.9% of the searches would never run that code, which is why I've proposed a delegation/wrapper approach from the beginning. I also think that we should make some allowances here and there, for the non-common case, and introduce better software design than specialized code. A Collector filter approach for some rare (or even less common) cases seems very reasonable to me. Also, I think that if we add to TSDC a create method which takes into account the previously scored lowest doc, it will confuse people. Now they will need to think where do I get this low score from? - but perhaps after I see the code, it wouldn't be such a bad thing just have a feeling TSDC and TFC should be left on their own, and extreme paging stuff should either be its own specialized collector, or a wrapper. paging collector Key: LUCENE-2215 URL: https://issues.apache.org/jira/browse/LUCENE-2215 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.4, 3.0 Reporter: Adam Heinz Assignee: Grant Ingersoll Priority: Minor Attachments: IterablePaging.java, LUCENE-2215.patch, PagingCollector.java, TestingPagingCollector.java http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898 Somebody assign this to Aaron McCurry and we'll see if we can get enough votes on this issue to convince him to upload his patch. :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2348) DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers
DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers - Key: LUCENE-2348 URL: https://issues.apache.org/jira/browse/LUCENE-2348 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9.2 Reporter: Trejkaz DuplicateFilter currently works by building a single doc ID set, without taking into account that getDocIdSet() will be called once per segment and only with each segment's local reader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2348) DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers
[ https://issues.apache.org/jira/browse/LUCENE-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Trejkaz updated LUCENE-2348: Component/s: (was: Search) contrib/* Changing to contrib, only just realised it was in that location... DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers - Key: LUCENE-2348 URL: https://issues.apache.org/jira/browse/LUCENE-2348 Project: Lucene - Java Issue Type: Bug Components: contrib/* Affects Versions: 2.9.2 Reporter: Trejkaz DuplicateFilter currently works by building a single doc ID set, without taking into account that getDocIdSet() will be called once per segment and only with each segment's local reader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2323) reorganize contrib modules
[ https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850012#action_12850012 ] Robert Muir commented on LUCENE-2323: - Committed 927696 (and 927697 for the solr piece). Will keep the issue open and work on a patch for the next part. reorganize contrib modules -- Key: LUCENE-2323 URL: https://issues.apache.org/jira/browse/LUCENE-2323 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2323.patch it would be nice to reorganize contrib modules, so that they are bundled together by functionality. For example: * the wikipedia contrib is a tokenizer, i think really belongs in contrib/analyzers * there are two highlighters, i think could be one highlighters package. * there are many queryparsers and queries in different places in contrib -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org