Re: Proposal about Version API "relaxation"
On Wed, Apr 14, 2010 at 12:49:52AM -0400, Robert Muir wrote: > its very unnatural for release 3.0 to be almost a no-op and for release 3.1 > to provide a new default index format and support for customizing how the > index is stored. And now we are looking at providing flexibility in scoring > that will hopefully redefine lucene from being a vector-space search engine > library to something much more flexible? This is a minor release?! I agree, but what really bothers me are the X.9 releases. 2.9 changed performance characteristics dramatically enough that it was a backwards-break in all but name for many users -- most prominently, Solr[1]. Solr's FieldCache RAM requirements doubled because of the transition to per-segment search. And 2.9's backwards compatibility layer in TokenStream was significantly slower. In my opinion, the transition to per-segment search and new-style TokenStreams should have triggered a major version break. Had that been the case, less effort could have been spent on backwards compatibility shims and fewer API design compromises would have been necessary. To avoid such costs in the future, and to communicate disruptions in the library to users via version numbers more accurately... * There should not be a Lucene 3.9. * Lucene 4.0 should do more than remove deprecations. Marvin Humphrey [1] Thanks to Robert and Mark Miller for reminding me just what the Solr/Lucene-2.9 problems were via IRC. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API "relaxation"
On Wed, Apr 14, 2010 at 08:30:14AM -0400, Grant Ingersoll wrote: > The thing I keep going back to is that somehow Lucene has managed for years > (and I mean lots of years) w/o stuff like Version and all this massive back > compatibility checking. Non-constant global variables are an anti-pattern. Having a non-constant global determine library behavior which results in silent failure (search results degrade subtly, as opposed to e.g. an exception being thrown) is a particularly insidious anti-pattern. In the Perl world, where modules are very heavily used thanks to CPAN, you're more likely to come across the action-at-a-distance bugs spawned by this anti-pattern. I have direct experience debugging such usage of global vars. It is extremely costly and frustrating. For instance, there was one time when some module set the global variable $YAML::Syck::ImplicitUnicode to a true value. Whether or not that module was loaded affected how YAML::Syck's Load() function would interpret character data in completely unrelated portions of the code. As with subtly degraded search results, the result was silent failure (incorrect text stored in a database). It took many hours to hunt down what was going wrong because the code that was causing the problem was nowhere near the code where the problem manifested. The authors of the affected code had done nothing wrong, aside from using a poorly designed module like YAML::Syck. I am strongly opposed to using a global variable for versioning because I do not wish to impose such maddening debugging sessions on a handful of unlucky duckies who have done nothing wrong other than to choose Lucene as their search engine library. This shouldn't be controversial. The temptations of global variables are obvious, but their flaws are well understood: http://www.google.com/search?q=global+variables+evil It is to be expected that the global would work most of the time. This design flaw, by nature, disproportionately afflicts a small number of users with action-at-a-distance bugs. Knowingly choosing to impose such costs on a random few is deeply unfair. > I also am not sure whether it in the past we just missed/ignored more back > compatibility issues or whether now we are creating more back compat. issues > due to more rapid change. It would be hard to search the archives to confirm my recollection, but I seem to remember back compat for Analyzers coming up every once in a while -- say, in the context of modifying StandardAnalyzer's stoplist -- and changes not being made because they would change search results. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API "relaxation"
On Tue, Apr 13, 2010 at 02:46:56PM -0400, Robert Muir wrote: > Unlike other components in Lucene, Analyzers get hit the worst because any > change is basically a break, and there's really not any other option besides > Version to implement any backwards compatibility for them. New class names would work, too. I only mention that for the sake of completeness, though -- it's not a suggestion. > But things like index back compat seems kinda useless for analyzers anyway. > If we improve them in nearly any way, you have to reindex with them to get > the benefits. I'm a little concerned about the issue DM Smith brought up: what happens when you have separate applications within the same JVM which have built indexes using separate versions of an Analyzer? That use case is supported under the current regime, but I'm not sure whether it would be with aggressively versioned Analyzer packages. If it's not, under what circumstances does that matter? > I'd love to hear elaborations of any thoughts you have on how this could > work. Well, for Lucy, I think we may have addressed this problem with the new back compat policy we're auditioning with KS: KinoSearch spins off stable forks into new namespaces periodically. As of this release, the latest is "KinoSearch1", forked from version 0.165. Users who require strong backwards compatibility should use a stable fork. The main namespace, "KinoSearch", is an unstable development branch (as hinted at by its version number). Superficial API changes are frequent. Hard file format compatibility breaks which require reindexing are rare, as we generally try to provide continuity across multiple releases, but they happen every once in a while. Essentially, we're free to break back compat within "Lucy" at any time, but we're not able to break back compat within a stable fork like "Lucy1", "Lucy2", etc. So what we'll probably do during normal development with Analyzers is just change them and note the break in the Changes file. I doubt such a policy would be an option for Lucene, though. I think you'd have to go with separate jars for lucene-core and lucene-analyzers, possibly on independent release schedules. You'd have to bundle the broken ones with lucene-core until a major version break for bug compatibility, but the fixed ones could be distributed via lucene-analyzers concurrently. Hmm, I suppose that doesn't work with the convention that the only difference between Lucene X.9 and Lucene Y.0 is the removal of deprecations. But if anything is crying out for a rethink in the Lucene back compat policy, IMO that's it: make major version breaks act like major version breaks and change stuff that needs changin'. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API "relaxation"
On Tue, Apr 13, 2010 at 11:17:56AM -0700, Andi Vajda wrote: > Using global statics is flawed. +1. I wonder if it's possible to solve this problem for Analyzers by decoupling their distribution from the Lucene core and versioning them separately. I.e. remove MatchVersion and increment individual Analyzer version numbers instead. This wouldn't solve the problem for good defaults elsewhere in the library. For that, I see no remedy other than more frequent major version increments. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Changing the subject for a JIRA-issue (Was: [jira] Created: (LUCENE-2335) optimization: when sorting by field, if index has one segment and field values are not needed, do not load String[] into f
On Tue, Apr 06, 2010 at 11:26:23AM +0200, Toke Eskildsen wrote: > The current subject and description of > https://issues.apache.org/jira/browse/LUCENE-2335 > is obsolete due to new knowledge. > > Is it possible to change it? If not, what is the policy here? To open a > new issue and close the old one? No policy, per se. Here's my take, FWIW: Do whatever it takes to keep the subject line of your messages in tune with the content of your messages. When they conflict, it becomes harder to follow the conversation, and our knowledge base degrades because relevant posts become harder to discover via search. In this case, that would mean either closing this issue and opening a new one, or taking the discussion to the mailing list where subject headers may be modified as the conversation evolves. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850356#action_12850356 ] Marvin Humphrey commented on LUCENE-2345: - > Is there a ticket or wiki page that details the "plugin" architecture/design > so i could take a look? FWIW, KinoSearch has a complete prototype implementation of this design, based loosely on the mailing list conversations that Earwin referred to. * SegReader and SegWriter are both composites with minimal APIs. * All subcomponents subclass either DataWriter or DataReader. * The Architecture class (under KinoSearch::Plan) determines which plugins get loaded. [http://www.rectangular.com/svn/kinosearch/trunk/core/] > Make it possible to subclass SegmentReader > -- > > Key: LUCENE-2345 > URL: https://issues.apache.org/jira/browse/LUCENE-2345 > Project: Lucene - Java > Issue Type: Wish > Components: Index >Reporter: Tim Smith > Fix For: 3.1 > > Attachments: LUCENE-2345_3.0.patch, LUCENE-2345_3.0.plugins.patch > > > I would like the ability to subclass SegmentReader for numerous reasons: > * to capture initialization/close events > * attach custom objects to an instance of a segment reader (caches, > statistics, so on and so forth) > * override methods on segment reader as needed > currently this isn't really possible > I propose adding a SegmentReaderFactory that would allow creating custom > subclasses of SegmentReader > default implementation would be something like: > {code} > public class SegmentReaderFactory { > public SegmentReader get(boolean readOnly) { > return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); > } > public SegmentReader reopen(SegmentReader reader, boolean readOnly) { > return newSegmentReader(readOnly); > } > } > {code} > It would then be made possible to pass a SegmentReaderFactory to IndexWriter > (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, > etc) > I could prepare a patch if others think this has merit > Obviously, this API would be "experimental/advanced/will change in future" -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote: > > Maybe aggressive automatic data-reduction makes more sense in the context of > > "flexible matching", which is more expansive than "flexible scoring"? > > I think so. Maybe it shouldn't be called a Similarity (which to me > (though, carrying a heavy curse of knowledge burden...) means > "scoring")? Matcher? I think we can express the difference between your proposed approach for Lucene Similarity (no effect on index) and my proposed approach for Lucy Similarity (aggressive index-time data reduction) by putting Lucy's Similarity under Lucy::Index instead of Lucy::Search. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
use not only can one thing's relevance to another be continuously variable (i.e. score), it can also be binary: relevant/not-relevant (i.e. match). But I don't see why "Relevance", "Matcher", or anything else would be so much better than "Similarity". I think this is your hang up. ;) > > I'm +0 (FWIW) on search-time Sim settability for Lucene. It's a nice > > feature, > > but I don't think we've worked out all the problems yet. If we can, I might > > switch to +1 (FWIW). > > What problems remain, for Lucene? Storage, formatting, and compression of boosts. I'm also concerned about making significant changes to the file format when you've indicated they're "for starters". IMO, file format changes ought to clear a higher bar than that. But I expect to to dissent on that point. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: #lucene IRC log [was: RE: lucene and solr trunk]
On Tue, Mar 23, 2010 at 01:30:42PM -0700, Otis Gospodnetic wrote: > Archiving the logs feels like it would be useful, but realistically > speaking, they would be pretty big and who has the time to read them after > the fact? Someone who participated in the chat reviewing it while preparing a summary. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
ser made when they spec'd MatchSimilarity. Saying that Lucy should keep the boost bytes under those circumstances is like saying that Lucene should outright ignore omitNorms() and always write boost bytes because users can't be trusted. > > I meant that if you're writing out boost bytes, there's no sensible way to > > execute the lossy data reduction and reduce the index size other than having > > Sim do it. > > Right Sim is the right class to do this. Heck one could even use > boost nibbles... or, use float. This is an impl detail of the Sim > class. For Lucene, I think that makes sense, because the reduced form would be ephemeral. For Lucy, it's more complicated because the reduced data gets written to the index. Core Sim implementations should all use the same algorithm in order to minimize the complexity of the index file spec. However, it would be nice to offer an extension point enabling user-defined Sims to write non-standard formats. > I think this all boils down to how important flexible scoring is -- Oh, please, Mike. Search-time settability for Similarity isn't the same thing as "flexible scoring". :( Everybody thinks "flexible scoring" is important. Frankly, I think we're going to do a better job making "flexible scoring" available to our users because we're not going to make them fight through a thicket of jargon to get it. > I'd like users to be able to try out different scoring at search > time, even if it means "having to understand low level stuff" when > setting their field types during indexing. > > You don't think flexible scoring is that important ("just reindex") > and that's it's not great to have users understand low level stats for > indexing. I'm +0 (FWIW) on search-time Sim settability for Lucene. It's a nice feature, but I don't think we've worked out all the problems yet. If we can, I might switch to +1 (FWIW). For Lucy, I'm -1 on search-time Sim settability, for a wide variety of reasons. Whether or not to perform automatic data-reduction based on Similarity choice or force the user to specify data-reduction manually is a separate issue. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
more important it is for them to buy optimization seminars where Lucene gurus explain all the obscure incantations to them. :) > > You seem to be fixated on the notion of swapping in a MatchOnlySim object at > > search time. You can't do that in KS/Lucy, because you can't modify a > > Schema > > at search-time, and the per-field Similarity assignments are part of the > > Schema. But *it doesn't matter* because you don't need a MatchOnlySim to > > do doc-id-only postings iteration -- an AllBellsAndWhistlesScoringSim can > > spawn a doc-id-only PostingDecoder just as easily as MatchOnlySim can. > > I am fixated because it's a glaring example (to me) of what's wrong > with forcing user to commit to how scoring is going to happen, at > index time, for that field. Haha, well that would sure suck if it didn't work! But I'm telling you it's no problem. > And I'm still confused on how this'll work in Lucey -- if in my global > write-once Lucy scheme I bind a field during indexing to > AllBellsAndWhistlesScoringSim... then at search time, sure, it can > spawn a doc-id-only PostingDecoder... so that does mean I can do > match-only searching using that, somehow? Of course. Lucene can't do that? No way, that can't be right! I've gotta be missing something. (Though I guess that would explain the fixation on needing a different Sim.) Needing a special Sim for match-only seems like an absurd limitation -- I mean the doc id data is there, and you don't need scores. You've gotta be able to fake it at least. > (Ie I can't change the field to MatchOnlySim, but, I have a some workaround > that lets me achieve the same functionality...?). It's not a workaround. Things just work that way. Without getting into the gory details... if you're not calculating a score, you don't need Similarity's functionality. If Lucene still needs a Sim object despite not needing its functionality, that's just an accident of the OO design, and it so happens that our "loose C" port doesn't have the same quirk. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
the index. But you can still do > > doc-id-only posting iteration against any posting format since doc-id-only > > is > > the minimum requirement for a posting list. > > > > So your question is predicated on the assumption that you need a > > doc-id-only Similarity to do doc-id-only postings iteration, but that's not > > true -- you need a doc-id-only PostingDecoder, which may be spawned by any > > Similarity. > > > > Does that make sense? > > It sounds like... if the user had used AllBellsAndWhistlesScoringSim > while indexing, they will still be able to use MatchOnlySim while > searching because under-the-hood MatchOnlySim knows how to pull a > docID only postings iterator from that field. You seem to be fixated on the notion of swapping in a MatchOnlySim object at search time. You can't do that in KS/Lucy, because you can't modify a Schema at search-time, and the per-field Similarity assignments are part of the Schema. But *it doesn't matter* because you don't need a MatchOnlySim to do doc-id-only postings iteration -- an AllBellsAndWhistlesScoringSim can spawn a doc-id-only PostingDecoder just as easily as MatchOnlySim can. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2316) Define clear semantics for Directory.fileLength
[ https://issues.apache.org/jira/browse/LUCENE-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844906#action_12844906 ] Marvin Humphrey commented on LUCENE-2316: - Is it really necessary to obtain the length of a file from the Directory? Lucy doesn't implement that functionality, and we haven't missed it -- we're able to get away with using the length() method on InStream and OutStream. I see that IndexInput and IndexOutput already have length() methods. Can you simply eliminate all uses of Directory.fileLength() within core and deprecate it without introducing a new method? > Define clear semantics for Directory.fileLength > --- > > Key: LUCENE-2316 > URL: https://issues.apache.org/jira/browse/LUCENE-2316 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shai Erera >Priority: Minor > Fix For: 3.1 > > > On this thread: > http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201003.mbox/%3c126142c1003121525v24499625u1589bbef4c079...@mail.gmail.com%3e > it was mentioned that Directory's fileLength behavior is not consistent > between Directory implementations if the given file name does not exist. > FSDirectory returns a 0 length while RAMDirectory throws FNFE. > The problem is that the semantics of fileLength() are not defined. As > proposed in the thread, we'll define the following semantics: > * Returns the length of the file denoted by name if the file > exists. The return value may be anything between 0 and Long.MAX_VALUE. > * Throws FileNotFoundException if the file does not exist. Note that you can > call dir.fileExists(name) if you are not sure whether the file exists or not. > For backwards we'll create a new method w/ clear semantics. Something like: > {code} > /** > * @deprecated the method will become abstract when #fileLength(name) has > been removed. > */ > public long getFileLength(String name) throws IOException { > long len = fileLength(name); > if (len == 0 && !fileExists(name)) { > throw new FileNotFoundException(name); > } > return len; > } > {code} > The first line just calls the current impl. If it throws exception for a > non-existing file, we're ok. The second line verifies whether a 0 length is > for an existing file or not and throws an exception appropriately. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
to describe how the field will be scored. Based on that info, we can customize the posting format, possibly making optimizations and omitting certain posting data. When people ask on the user list... "How can I make my index smaller?" ... we can reply like so: "Make some fields match-only by specifying MatchSimilarity in the FieldType, or even better if you don't need phrase queries, by specifying MinimalSimilarity. You'll be throwing away data Lucy needs for sophisticated queries, but your index will get smaller." I think that response is easier to understand than a response instructing them to "enable omitNorms", and it introduces the very important Similarity class rather than the confusing, overloaded, and not-very-useful terminology, "norms". > >> > They could use better codecs under the format-follows-Similarity model, > >> > too. > >> > They'd just have to subclass and override the factory methods that spawn > >> > posting encoders/decoders. > >> > >> Ahh, OK so that's how they'd do it. > >> > >> So... I think we're making a mountain out of a molehill. > > > > Well, I don't see it that way, because I place great value on designing > > good public APIs, and I think it's important that we avoid forcing users to > > know about codecs. > > I had thought we were bickering about whether you subclass & override > a method (to alter the codec) (= Lucy) vs you create your own > Codec/CodecProvider and pass that to your writer, which seems. a > minor difference. > > If the user is not tweaking the codec, they don't have to do anything > with codes (the defaults work) for either Lucy or Lucene. > > So the only difference is the specifics of how the codec-tweaking-user > in fact alters the codec. I don't think that's the only difference. What does the novice user know about "PFOR", about "pulsing", about "group varint", etc? They aren't going to know jack. So how are you expecting them to distinguish between various Codec subclasses named after those high-falutin' concepts? The difference is that you're forcing the novice user to learn esoteric material just to get started, while the format-follows-sim model is trying to spare the novice yet enable the expert. Users shouldn't have to distinguish between "codecs" until they are actually ready to write their own. As we discussed on IRC yesterday, the number of people who will be qualified to write posting codec code will still be very small, even after we finish this democratization push. It will be a big step forward if we can just get more Lucene committers to grok the inner workings of posting lists. However, there are some very useful optimizations that will be underutilized by the user base if the public API uses jargon like "omitTFAP" and "PFORCodec" that shuts out everyone except elite developers. > > Under format-follows-Sim, it would be the Similarity object that knows all > > supported decoding configurations for the field. > > I'm still hazy on how you'll know at search time which Sims are > "congruent" with what's stored in the index ie that downgrading to > MatchOnlySim is allowed, but swapping to a different scoring model is > not (because norms are committed at indexing time). I'm not sure that e.g. TermScorer would even know what Similarity it was dealing with. It would ask for a boost-byte decoder from the sim, but it wouldn't have to know or care how the boost bytes got translated to float multipliers. Under Lucy, you can't switch to a different weighting model at search time because the boost bytes are baked into the index. But you can still do doc-id-only posting iteration against any posting format since doc-id-only is the minimum requirement for a posting list. So your question is predicated on the assumption that you need a doc-id-only Similarity to do doc-id-only postings iteration, but that's not true -- you need a doc-id-only PostingDecoder, which may be spawned by any Similarity. Does that make sense? Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844688#action_12844688 ] Marvin Humphrey commented on LUCENE-2308: - > Also creating a FieldType with args like > new FieldType(true, false, false) isn't really readable. Agreed Another option would be a "flags" integer and bitwise constants: {code} FieldType type = new FieldType(analyzer, FieldType.INDEXED | FieldType.STORED); {code} > It would be nice if we could do something similar to IndexWriterConfig > (LUCENE-2294), where you use incremental ctor/setters to set up the > configuration but then once it's used ("bound" to a Field), it's > immutable. I bet that'll be more popular than flags, but I thought it was worth bringing it up anyway. :) > Separately specify a field's type > - > > Key: LUCENE-2308 > URL: https://issues.apache.org/jira/browse/LUCENE-2308 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > This came up from dicussions on IRC. I'm summarizing here... > Today when you make a Field to add to a document you can set things > index or not, stored or not, analyzed or not, details like omitTfAP, > omitNorms, index term vectors (separately controlling > offsets/positions), etc. > I think we should factor these out into a new class (FieldType?). > Then you could re-use this FieldType instance across multiple fields. > The Field instance would still hold the actual value. > We could then do per-field analyzers by adding a setAnalyzer on the > FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise > for per-field codecs (with flex), where we now have > PerFieldCodecWrapper). > This would NOT be a schema! It's just refactoring what we already > specify today. EG it's not serialized into the index. > This has been discussed before, and I know Michael Busch opened a more > ambitious (I think?) issue. I think this is a good first baby step. We could > consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold > off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2308) Separately specify a field's type
On Fri, Mar 12, 2010 at 03:01:27PM -0500, Mark Miller wrote: > Committers are competant in different areas of the code. Even mike > wasn't big into the search side until per segment. Commiters are > trusted to mess with the pieces they know. Absolutely. I wouldn't expect every committer to undertand the gory details of posting formats, and I've been a little caught off guard by the blowback from what I thought was an inoccuous observation. But by the same token, I wouldn't expect our users to have sufficient expertise to understand all the variants of omit*() either. This stuff oughtta be implementation details. > I don't see anyone even remotely suggesting that users should have to > understand all of the implications of posting format modifications. That's what omitTFAP() and omitNorms() do, though. And as Mike pointed out in the "baby steps" thread, omitTFAP() is often misunderstood. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844659#action_12844659 ] Marvin Humphrey commented on LUCENE-2308: - I'm simply suggesting that the proposed API is too hard to understand. Most users know whether their fields can be "match-only" but have no idea what TFAP is. And even advanced users will have difficulty understanding all the implications for matching and scoring when they selectively disable portions of the posting format. I'm not a fan of omitTFAP, omitTF, omitNorms, omitPositions, or omit(flags). Something that ordinary users can grok would be used more often and more effectively. > Separately specify a field's type > - > > Key: LUCENE-2308 > URL: https://issues.apache.org/jira/browse/LUCENE-2308 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > This came up from dicussions on IRC. I'm summarizing here... > Today when you make a Field to add to a document you can set things > index or not, stored or not, analyzed or not, details like omitTfAP, > omitNorms, index term vectors (separately controlling > offsets/positions), etc. > I think we should factor these out into a new class (FieldType?). > Then you could re-use this FieldType instance across multiple fields. > The Field instance would still hold the actual value. > We could then do per-field analyzers by adding a setAnalyzer on the > FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise > for per-field codecs (with flex), where we now have > PerFieldCodecWrapper). > This would NOT be a schema! It's just refactoring what we already > specify today. EG it's not serialized into the index. > This has been discussed before, and I know Michael Busch opened a more > ambitious (I think?) issue. I think this is a good first baby step. We could > consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold > off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844637#action_12844637 ] Marvin Humphrey commented on LUCENE-2308: - If you disable term freq, you also have to disable positions. The "freq" tells you how many positions there are. I think it's asking an awful lot of our users to require that they understand all the implications of posting format modifications when committers have difficulty mastering all the subtleties. > Separately specify a field's type > - > > Key: LUCENE-2308 > URL: https://issues.apache.org/jira/browse/LUCENE-2308 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > This came up from dicussions on IRC. I'm summarizing here... > Today when you make a Field to add to a document you can set things > index or not, stored or not, analyzed or not, details like omitTfAP, > omitNorms, index term vectors (separately controlling > offsets/positions), etc. > I think we should factor these out into a new class (FieldType?). > Then you could re-use this FieldType instance across multiple fields. > The Field instance would still hold the actual value. > We could then do per-field analyzers by adding a setAnalyzer on the > FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise > for per-field codecs (with flex), where we now have > PerFieldCodecWrapper). > This would NOT be a schema! It's just refactoring what we already > specify today. EG it's not serialized into the index. > This has been discussed before, and I know Michael Busch opened a more > ambitious (I think?) issue. I think this is a good first baby step. We could > consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold > off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844626#action_12844626 ] Marvin Humphrey commented on LUCENE-2308: - I think we might consider matchOnly() instead of omitNorms(). If a field is "match only", we don't need boost bytes a.k.a. "norms" because they are only used as a scoring multiplier. Haven't got a good synonym for "omitTFAP", but I'd sure like one. > Separately specify a field's type > - > > Key: LUCENE-2308 > URL: https://issues.apache.org/jira/browse/LUCENE-2308 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless > > This came up from dicussions on IRC. I'm summarizing here... > Today when you make a Field to add to a document you can set things > index or not, stored or not, analyzed or not, details like omitTfAP, > omitNorms, index term vectors (separately controlling > offsets/positions), etc. > I think we should factor these out into a new class (FieldType?). > Then you could re-use this FieldType instance across multiple fields. > The Field instance would still hold the actual value. > We could then do per-field analyzers by adding a setAnalyzer on the > FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise > for per-field codecs (with flex), where we now have > PerFieldCodecWrapper). > This would NOT be a schema! It's just refactoring what we already > specify today. EG it's not serialized into the index. > This has been discussed before, and I know Michael Busch opened a more > ambitious (I think?) issue. I think this is a good first baby step. We could > consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold > off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
On Mon, Mar 08, 2010 at 02:10:35PM -0500, Michael McCandless wrote: > We ask it to give us a Codec. There's a conflict between the segment-wide role of the "Codec" class and its role as specifier for posting format. In some sense, you could argue that the "codec" reads/writes the entire index segment -- which includes not only postings files, but also stored fields, term vectors, etc. However, the compression algorithms after which these codecs are named have nothing to do with those other files. PFORCodec isn't relevant to stored fields. I'd argue for limiting the role of "Codec" to encoding and decoding posting files. As far as modularizing other aspects of index reading and writing, I don't think a simple factory is the way to go. I favor using a composite design pattern for SegWriter and SegReader (rather than subclassing), and an initialization phase controlled by an Architecture object. It was Earwin Burrfoot who persuaded me of the merits of a user-defined initialization phase over a user-defined factory method: <http://markmail.org/message/ukhcvp2ydfxpcg7q>. > So far my fav is still CodecProvider ;) It seems that the primary reason this object is needed is that IndexReader needs to be able to find the right decoder when it encounters an unfamiliar codec name. Since the core doesn't know about user-created codecs, it's necessary for the user to register the name => codec pairing in advance so that core can find it. If that's this object's main role, I'd suggest "CodecRegistry". > Naming is the hardest part!! For me, the hardest parts of API design are... A) Designing public abstract classes / interfaces. B) Compensating for the curse of knowledge. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
rity object that knows all supported decoding configurations for the field. > > You don't want to use the stronger, more constrictive check, right? > > You mean single inheritance? No. Because then we hardwire the attrs > to the Codec. Standard codec should encode whatever attrs the app > hands us... I think. I might approach things the same way if Clownfish supported interface method dispatch. :) As it is, though, I'm not sure that the single inheritance requirement is an important liability. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Multi-node stats within individual nodes (was "Baby steps...")
On Tue, Mar 09, 2010 at 01:04:19PM -0500, Michael McCandless wrote: > BM25 needs the field length in tokens. lnu.ltc needs avg(tf). These > 2 stats seem to the "common" ones (according to Robert). So I want to > start with them. OK, interesting. > > I don't know that compressing the raw materials is going to work as well as > > compressing the final product. Early quantization errors get compounded > > when > > used in later calculations. > > I would not compress for starters... How about lossless compression, then? Do you need random access into this specialized posting list? For the use cases you've described so far I don't think so, since you're just iterating it top to bottom on segment open. You could store the total length of the field in tokens and the number of unique terms as integers, compressing with vbyte, PFOR or whatever... then divide at search time to get average term frequency. That way, you also avoid committing to a float encoding, which I don't think Lucene has standardized yet. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
On Tue, Mar 09, 2010 at 05:06:08AM -0500, Michael McCandless wrote: > > For what it's worth, that's sort of the way KS used to work: > > Schema/FieldType > > information was stored entirely in source code. That's changed and now we > > serialize the whole schema including all Analyzers, but source-code-only is > > a > > viable approach. > > Hmm but KS still somehow enforced strong typing across indexing > sessions? Nope, it wasn't enforced. > You said "of course" before but... how in your proposal could one > store all stats for a given field during indexing, but then sometimes > use match-only and sometimes full-scoring when querying against that > field? The same way that Lucene knows that sometimes it needs a docs-only-enum and sometimes it needs a docs-and-positions enum. Sometimes you need scores, sometimes you don't. > >> If user switches up their codec then they'll need to ensure it also > >> stores stats required by their Sim(s). > > > > That's backwards, IMO. > > I'm still baffled. If I wanna play a movie on my 1080P monitor I'll > need to find a movie that was encoded hidef (ie, bluray not dvd). > > I mean, I don't have to. DVD content will play fine still... just > degraded quality. Heh. Consumers hate format wars In this case, though, we're dealing with software, not DVD hardware, so upgrading is a lot easier. Under the format-follows-Similarity model, the relationship between Similarity and posting format is more akin to the relationship between a container format like Quicktime and codecs like Sorenson 3 or H.264. Tweakers will want to go in and monkey with the choice of codec within the Quicktime file, but most users will just trust us to use the latest and greatest. > > The posting format encoding should be an implementation detail. The general > > user should be expressing their intent as far as how they want the field to > > be > > scored, and the posting format should flow from that. > > Maybe it's that it bothers you that with this proposed changed the > user makes 2 decisions -- Codec and Sim? Yes, and it bothers me that users have to know about codecs at all, when in the vast majority of cases it doesn't matter because the default is going to be the best choice. Since compression algorithm performance depends on knowing how to exploit patterns in the data and sometimes the user will know about patterns that are opaque to us, in some circumstances they will be able to select a more appropriate codec. But that's not the common case, as it requires both unusual data and an unusually sophisticated user. What users will be able to tell us is how they want the field to be used, and we can use that information to help us optimize. For example, when a user declares that they want a field to be "match-only", we know we don't have to write boost bytes, freq or positions, saving space. > Ie user will choose PFor or Standard or Pulsing(PFor/Standard) codec, and > then separately choose Sim? > > But these are important choices. They should be separate. Why > force-bundle them? Because most of the time the user isn't going to be able to improve on the default. > > Whether we use VInt, PFOR, group varint, hand-tuned bit shifting, etc under > > the hood to implement BM25, match-only, boost-per-position or whatever > > shouldn't be the user's concern. As time goes on, we should allow ourselves > > the flexibility to use new compression techniques to write new segments. > > But w/ the proposed change Lucene users will be free to use better > codecs? They could use better codecs under the format-follows-Similarity model, too. They'd just have to subclass and override the factory methods that spawn posting encoders/decoders. > Are you worried about proper defaulting? We'll handle that > (under Version). I don't think it's necessary or desirable to handle this with Version. A codec improvement (say, encoding match-only fields using PFOR instead of VInts) would simply trigger an index format number increment, and new segments would be written using the latest format. > > There's no difference between calling enum.nextPosition() and > > positions.next(), is there? > > Right now it's a 2 step process when you access via attr -- first you > ask the enum to next(), then you ask each attr associated w/ that enum > for their value. OK, I think I see where the limitation arises. In Lucy/KS, we'd just access the positions value as a member variable (direct struct access) rather than invoking a method. By default, struct definitions are opaque and thus member vars are inaccessible (to encourage loose cou
Re: Multi-node stats within individual nodes (was "Baby steps...")
On Mon, Mar 08, 2010 at 02:23:47PM -0500, Michael McCandless wrote: > For a large index the stats will be stable after re-indexing only a > few more docs. Well, not if there's been huge churn on other nodes in the interim. > No... the stat is avg tf within the doc. Don't you need the *total* field length -- not just the average tf -- for the docXfield in question to perform length normalization? Or is average term frequency within the docXfield a BM25-specific precursor that you are using as an example stat? > So if I index this doc: > > a a a a b b b c c d > > The avg(tf) = average(4 3 2 1) = 2.5. > > So we'd store 2.5 for that docXfield in a fixed-width dense postings > list (like column stride fields -- every doc has a value). Like column-stride fields, but also analogous to the current "norms" -- only with 4x the space requirements. That is, unless you compress that float down to a byte, as is currently done with the norm (3 bit mantissa, 5 bit exponent). The generation of a "norm" byte involves some pretty intense lossy data-reduction. If you're going to store the pre-data-reduction raw materials, you're going to incur a space penalty unless you can eke out similar savings somewhere. The coarse quantization is justified because we only care about big differences at search-time. If two documents are judged as reasonably close to each other in relevance, the order in which they rank isn't important. It's only when docs are judged as far apart in relevance that their relative rank order matters. I don't know that compressing the raw materials is going to work as well as compressing the final product. Early quantization errors get compounded when used in later calculations. BTW, I think we should refer to these bytes as "boost bytes" rather than "norms". Their purpose is not simply to convey length normalization; they also include document boost and field boost. And the length normalization multiplier is a kind of boost... so "boost byte" has everything covered, and avoids the overloading of the term "norm". Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
ldn't be the user's concern. As time goes on, we should allow ourselves the flexibility to use new compression techniques to write new segments. > > Just a thought: why not make positions an attribute on a DocsEnum? > > Maybe... though I think the double method call (enum.next() then > posAttr.get()) is too much added cost. Why wouldn't it work to have the consumer extract the positions attribute from the DocsEnum during construction? There's no difference between calling enum.nextPosition() and positions.next(), is there? Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
ot;normal" postings iterator. As to whether we expose the part-of-speech via an attribute or via a method, that's up in the air. Hmm. >From a class-design perspective, it would probably be best to go with an attribute, since Lucy has only single-inheritance and no interfaces. A rigid class hierarchy is going to cause problems when you need an iterator that combines unrelated concepts like BM25 weighting and part-of-speech tagging. > In flex you'd get a "normal" DocsAndPositionsEnum, pull the POS attr > up front, and as you're next'ing your way through it, optionally look > up the POS of each position you step through, using the POS attr. Just a thought: why not make positions an attribute on a DocsEnum? Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Multi-node stats within individual nodes (was "Baby steps...")
On Sat, Mar 06, 2010 at 05:07:18AM -0500, Michael McCandless wrote: > > Fortunately, beaming field length data around is an easier problem than > > distributed IDF, because with rare exceptions, the number of fields in a > > typical index is miniscule compared to the number of terms. > > Right... so how do we control/configure when stats are fully > recomputed corpus wide hmmm. Should be fully app controllable. Hmm, at first, I don't like the sound of that. Right now, we're talking about an esoteric need for a specific plugin, BM25 similarity. The top level indexer object should be oblivious to the implementation details of plugins. However, the theme here is the need for an individual node to sync up with the distributed corpus. If you don't do that at index time, you have to do it at search time, which isn't always ideal. So I can see us building in some sort of functionality to address that more general case. It would be the flip of the MultiSearcher-comprised-of-remote-searchables situation. > > I guess you'd want to accumulate that average while building the segment... > > oh wait, ugh, deletions are going to make that really messy. :( > > > > Think about it for a sec, and see if you swing back to the desirability of > > calculation on the fly using maxDoc(), like I just did. > > I think we'd store a float (holding avg(tf) that you computed when > inverting that doc, ie, for all unique terms in the doc what's the avg > of their freqs) for every doc, in the index. Then we can regen fully > when needed right? Hmm, full regeneration would be expensive, so I'd discounted it. You'd have to iterate the entire posting list for every term, adding up freq() while skipping deleted docs. > Or maybe we store sum(tf) and #unique terms... hmm. > > Handling docs that did not have the field is a good point... but we > can assign a special value (eg 0.0, or, any negative number say) to > encode that? Where? In the full field storage? To slow to recover. In the term dictionary? The term dictionary can't store nulls. You'd have to use sentinels... thus restricting the allowable content of the field?! No way. In the Lucy-style mmap'd sort cache? That would work, because we always have a "null ord", to which documents which did not supply a value for the field get assigned in the ords array. However, sort/field caches are orthogonal to this problem and we don't want to require them for an ancillary need. I suppose you could do it by iterating all posting lists for a field and flipping bits in a bit vector. The bits that are left unset correspond to docs with null values. > Deletions I think across the board will skew stats until they are > reclaimed. Yes, and unless the stats are fully regenerated when a segment with deletions get merged away, the averages will be wrong to some degree, with the skew potentially worsening over time. Say that you have a segment with an average field length of 5 for the "tags" field, but that that average is the result of most docs having 1 tag, while a handful of docs have 100 tags. Now say you delete all of the docs with 100 tags. The recorded average for the "tags" field within the segment is now all messed up -- it should be "1", but it's "5". You have to regenerate a new, correct average when building a new segment. You can't use the existing value of "5" as a shortcut, or the consolidated segment's averages will be wrong from the get-go. That's what I was getting at earlier. However, I'd thought that we could get around the problem by fudging with maxDoc(), and I no longer believe that. I think full regeneration is the only way. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
t being to fall back to doc-id-only and discard data when an unknown posting format is encountered, I presume. > >> > Similarity is where we decode norms right now. In my opinion, it > >> > should be the Similarity object from which we specify per-field > >> > posting formats. > >> > >> I agree. > > > > Great, I'm glad we're on the same page about that. > > Actually [sorry] I'm not longer so sure I agree! > > In flex we have a separate Codec class that's responsible for > creating the necessary readers/writers. It seems like Similarity is a > consumer of these stats, but need not know what format is used to > encode them on disk? It's true that it's possible to separate out Similarity as a consumer. However, I'm also thinking about how to make this API as easy to use as possible. One rationale behind the proposed elevation of Similarity is that I'm not a fan of the name "Codec". I think it's too generic to use for the class which specifies a posting format. "PostingCodec" is better, but might be too long. In contrast, "Similarity" is more esoteric than "Codec", and thus conveys more information. For Lucy, I'm imagining a stripped-down Similarity class compared to current Lucene. It would bear the responsibility for setting policy as to how scores are calculated (in other words, judging how "similar" a document is to the query), but what information it uses to calculate that score would be left entirely open. Methods such as tf(), idf(), encodeNorm(), etc. would move to a TF/IDF-specific subclass. Here's a sampling of possible Similarity subclasses: * MatchSimilarity // core * TFIDFSimilarity // core * LongFieldTFIDFSimilarity // contrib * BM25Similarity// contrib * PartOfSpeechSimilarity// contrib For Lucy, Similarity would be specified as a member of a FieldType object within a Schema. No subclassing would be required to spec custom posting formats: Schema schema = new Schema(); FullTextType bm25Type = new FullTextType(new BM25Similarity()); schema.specField("content", bm25Type); schema.specField("title", bm25Type); StringType matchType = new StringType(new MatchSimilarity()); schema.specField("category", matchType); Since the Similarity instance is settable rather than generated by a factory method, that means it will have to be serialized within the schema JSON file, just like analyzers must be. I think it's important to make choosing a posting format reasonably easy. Match-only fields should be accessible to someone learning basic index tuning and optimization techniques. Actually writing posting codecs is totally different. Not many people are going to want to do that, though we should make it easy for experts. What's the flex API for specifying a custom posting format? > > What's going to be a little tricky is that you can't have just one > > Similarity.makePostingDecoder() method. Sometime's you'll want a > > match-only decoder. Sometimes you'll want positions. Sometimes > > you'll want part-of-speech id. It's more of a interface/roles > > situation than a subclass situation. > > match-only decoder is handled on flex now by asking for the DocsEnum > and then while iterating only using the .doc() (even if underlyingly > the codec spent effort decoding freq and maybe other things). > > If you want positions you get a DocsAndPositionsEnum. Right. But what happens when you want a custom codec to use BM25 weighting *and* inline a part-of-speech ID *and* use PFOR? I think we have to supply a class object or class name when asking for the enumerator, like you do with AttributeSource. PostingList plist = null; PostingListReader pListReader = segReader.fetch(PostingListReader); if (pListReader != null) { PostingsReader pReader = pListReader.fetch(field); if (pReader != null) { plist = pReader.makePostingList(klass); // e.g. PartOfSpeechPostingList } } Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Composing posts for both JIRA and email (was a JIRA post)
(CC to lucy-dev and general, reply-to set to general) On Thu, Mar 04, 2010 at 06:18:28AM +, Shai Erera (JIRA) wrote: > (Warning, this post is long, and is easier to read in JIRA) I consume email from many of the Lucene lists, and I hate it when people force me to read stuff via JIRA. It slows me down to have to jump to all those forum web pages. I only go the web page if there are 5 or more posts in a row on the same issue that I need to read. For what it's worth, I've worked out a few routines that make it possible to compose messages which read well in both mediums. * Never edit your posts unless absolutely necessary. If JIRA used diffs, things would be different, but instead it sends the whole frikkin' post twice (before and after), which makes it very difficult to see what was edited. If you must edit, append an "edited:" block at the end to describe what you changed instead of just making changes inline. * Use FireFox and the "It's All Text" plugin, which makes it possible to edit JIRA posts using an external editor such as Vim instead of typing into a textarea. <http://trac.gerf.org/itsalltext> * After editing, use the preview button (it's a little monitor icon to the upper right of the textarea) to make sure the post looks good in JIRA. * Use "> " for quoting instead of JIRA's "bq." and "{quote}" since JIRA's mechanisms look so crappy in email. This is easy from Vim, because rewrapping a long line (by typing "gq" from visual mode to rewrap the current selection) that starts with "> " causes "> " to be prepended to the wrapped lines. * Use asterisk bullet lists liberally, because they look good everywhere. * Use asterisks for *emphasis*, because that looks good everywhere. * If you wrap lines, use a reasonably short line length. (I use 78; Mike McCandless, who also wraps lines for his Jira posts, uses a smaller number). Otherwise you'll get nasty wrapping in narrow windows, both in email clients and web browsers. There are still a couple compromises that don't work out well. For email, ideally you want to set off code blocks with indenting: int foo = 1; int bar = 2; To make code look decent in JIRA, you have to wrap that with {code} tags, which unfortunately look heinous in email. Left-justifying the tags but indenting the code seems like it would be a rotten-but-salvageable compromise, as it at least sets off the tags visually rather than making them appear as though they are part of the code fragment. {code} int foo = 1; int bar = 2; {code} Unfortunately, that's going to look like this in JIRA, because of a bug that strips all leading whitespace from the first line. |-| | int foo;| | int bar;| |-| It seems that this has been fixed by Atlassian in the Confluence wiki (<http://jira.atlassian.com/browse/CONF-4548>), but the issue remains for the JIRA installation at issues.apache.org. So for now, I manually strip indentation until the whole block is flush left. {code} int foo = 1; int bar = 2; {code} (Gag. I vastly prefer wikis that automatically apply fixed-width styling to any indented text.) One last tip for Lucy developers (and other non-Java devs). JIRA has limited syntax highlighting support -- Java, JavaScript, ActionScript, XML and SQL only -- and defaults to assuming your code is Java. In general, you want to override that and tell JIRA to use "none". {code:none} int foo = 1; int bar = 2; {code} Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
On Tue, Mar 02, 2010 at 05:55:44AM -0500, Michael McCandless wrote: > The problem is, these scoring models need the avg field length (in > tokens) across the entire index, to compute the norms. > > Ie, you can't do that on writing a single segment. I don't see why not. We can just move everything you're doing on Searcher open to index time, and calculate the stats and norms before writing the segment out. At search time, the only segment with valid norms would be the last one, so we'd make sure the Searcher used those. I think the fact that Lucy always writes one segment per indexing session -- as opposed to Lucene's one segment per document -- makes a difference here. Whether burning norms to disk at index time is the most efficient setup depends on the ratio of commits to searcher-opens. In a multi-node search cluster, pre-calculating norms at index-time wouldn't work well without additional communication between nodes to gather corpus-wide stats. But I suspect the same trick that works for IDF in large corpuses would work for average field length: it will tend to be the stable over time, so you can update it infrequently. > So I think it must be done during searcher init. > > The most we can do is store the aggregates (eg sum of all lengths in > this segment) in the SegmentInfo -- this saves one pass on searcher > init. Logically... token_counts: { segment: { title: 4, content: 154, }, all: { title: 98342, content: 2854213 } } (Would that suffice? I don't recall the gory details of BM25.) As documents get deleted, the stats will gradually drift out of sync, just like doc freq does. However, that's mitigated if you recycle segments that exceed a threshold deletion percentage on a regular basis. > The norms array will be stored in this per-field sim instance. Interesting, but that wasn't where I was thinking of putting them. Similarity objects need to be sent over the network, don't they? At least they do in KS. So I think we need a local per-field PostingsReader object to hold such cached data. > > The insane loose typing of fields in Lucene is going to make it a > > little tricky to implement, though. I think you just have to > > exclude fields assigned to specific similarity implementations from > > your merge-anything-to-the-lowest-common-denominator policy and > > throw exceptions when there are conflicts rather than attempt to > > resolve them. > > Our disposition on conflict (throw exception vs silently coerce) > should just match what we do today, which is to always silently > coerce. What do you do when you have to reconcile two posting codecs like this? * doc id, freq, position, part-of-speech identifier * doc id, boost Do you silently drop all information except doc id? > > Similarity is where we decode norms right now. In my opinion, it > > should be the Similarity object from which we specify per-field > > posting formats. > > I agree. Great, I'm glad we're on the same page about that. > > Similarity implementation and posting format are so closely related > > that in my opinion, it makes sense to tie them. > > This confuses me -- what is stored in these stats (each field's token > length, each field's avg tf, whatever other a codec wants to add over > time...) should be decoupled from the low level format used to store > it? I don't know about that. I don't think it's necessary to decouple them. There might be some minor code duplication, but similarity implementations don't tend to be very large, so the DRY violation doesn't bother me. What's going to be a little tricky is that you can't have just one Similarity.makePostingDecoder() method. Sometime's you'll want a match-only decoder. Sometimes you'll want positions. Sometimes you'll want part-of-speech id. It's more of a interface/roles situation than a subclass situation. > > If you're looking for small steps, my suggestion would be to focus > > on per-field Similarity support. > > Well that alone isn't sufficient -- the index needs to record/provide > the raw stats, and doc boosting (norms array) needs to be done using > these stats. Not sufficient, but it's probably a prerequisite. Since it's a common feature request anyway, I think it's a great place to start: http://lucene.markmail.org/message/ln2xkesici6aksbi http://lucene.markmail.org/thread/46vxibpubogtcy3g http://lucene.markmail.org/message/56bk6wrbwallyjvr https://issues.apache.org/jira/browse/LUCENE-2236 Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
an index segment: its fields, document count, and so on. The Segment object itself writes one file, segmeta.json; besides storing info needed by Segment itself, the "segmeta" file serves as a central repository for metadata generated by other index components -- relieving them of the burden of storing metadata themselves. As far as aggregates go, I think you want to be careful to avoid storing any kind of data that scales with segment size within a SegmentInfo. > * Change Similarity, to allow field-specific Similarity (I think we > have issue open for this already). I think, also, lengthNorm > (which is no longer invoked during indexing) would no longer be > used. Well, as you might suspect, I consider that one a gimme. KinoSearch supports per-field Similarity now. The insane loose typing of fields in Lucene is going to make it a little tricky to implement, though. I think you just have to exclude fields assigned to specific similarity implementations from your merge-anything-to-the-lowest-common-denominator policy and throw exceptions when there are conflicts rather than attempt to resolve them. > I think we'd make the class that computes norms from these per-doc > stats on IR open pluggable. Similarity is where we decode norms right now. In my opinion, it should be the Similarity object from which we specify per-field posting formats. See my reply to Robert in the BM25 thread: http://markmail.org/message/77rmrfmpatxd3p2e That way, custom scoring implementations can guarantee that they always have the posting information they need available to make their similarity judgments. Similarity also becomes a more generalized notion, with the TF/IDF-specific functionality moving into a subclass. Similarity implementation and posting format are so closely related that in my opinion, it makes sense to tie them. > And, someday we could make what stats are gathered/stored during indexing > pluggable but for starters I think we should simply support the field length > in tokens and avg tf per field. I would argue against making this your top priority, because I think adding half-baked features that require index-format changes is bad policy. If you're looking for small steps, my suggestion would be to focus on per-field Similarity support. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2282) Expose IndexFileNames as public, and make use of its methods in the code
[ https://issues.apache.org/jira/browse/LUCENE-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837988#action_12837988 ] Marvin Humphrey commented on LUCENE-2282: - > As the API is now marked @lucene.internal, and it'll only be very > expert usage, I'm not as concerned as Marvin is about the risks of > even exposing this. Um, the only possible concerns I could have had were regarding public exposure of this API. If it's marked as internal, it's an implementation detail. Whether or not the dot is included in internal-use-only constant strings isn't something I'm going to waste a lot of time thinking about. ;) So now, not only do I really, really not care whether this goes in, I have no qualms about it either. Having users like Shai who are willing to recompile and regenerate to take advantage of experimental features is a big boon, as it allows us to test drive features before declaring them stable. Designing optimal APIs without usability testing is difficult to impossible. > Expose IndexFileNames as public, and make use of its methods in the code > > > Key: LUCENE-2282 > URL: https://issues.apache.org/jira/browse/LUCENE-2282 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Shai Erera >Assignee: Michael McCandless > Fix For: 3.1 > > Attachments: LUCENE-2282.patch, LUCENE-2282.patch, LUCENE-2282.patch > > > IndexFileNames is useful for applications that extend Lucene, an in > particular those who extend Directory or IndexWriter. It provides useful > constants and methods to query whether a certain file is a core Lucene file > or not. In addition, IndexFileNames should be used by Lucene's code to > generate segment file names, or query whether a certain file matches a > certain extension. > I'll post the patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2282) Expose IndexFileNames as public, and make use of its methods in the code
[ https://issues.apache.org/jira/browse/LUCENE-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837499#action_12837499 ] Marvin Humphrey commented on LUCENE-2282: - > Any application that extends IW, or provide its own Directory > implementation, and wants to reference Lucene's file extensions properly > (i.e. not by putting its code under o.a.l.index or hardcoding ".del" as its > deletions file) will benefit from making it public. > Forgot to tag IFN as @lucene.internal ? If the class is tagged as "internal", then external applications like the one you describe above aren't supposed to use it, right? But I don't really care about whether this goes into Lucene. Go ahead, make it fully public and omit the "internal" tag. Not my problem. :) The thing is, I really don't understand what kind of thing you want to do. Are you writing your own deletions files? I'm trying to understand because the only use cases I can think of for this aren't compatible with index pluggability, which is a high priority for Lucy. * Sweep "non-core-Lucene files" to "clean up" an index. * Gather up "core-Lucene files" for export. * Audit "core-Lucene files" to determine whether the index is in a valid state. * Differentiate between "core-Lucene and "non-core-Lucene" files when writing a compound file. Maybe there's something I haven't thought of, though. Why do you want to "reference Lucene's file extensions properly"? Once you've identified identified which files are "core Lucene" and which files aren't, what are you going to do with them? > Expose IndexFileNames as public, and make use of its methods in the code > > > Key: LUCENE-2282 > URL: https://issues.apache.org/jira/browse/LUCENE-2282 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Shai Erera > Fix For: 3.1 > > Attachments: LUCENE-2282.patch, LUCENE-2282.patch > > > IndexFileNames is useful for applications that extend Lucene, an in > particular those who extend Directory or IndexWriter. It provides useful > constants and methods to query whether a certain file is a core Lucene file > or not. In addition, IndexFileNames should be used by Lucene's code to > generate segment file names, or query whether a certain file matches a > certain extension. > I'll post the patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2282) Expose IndexFileNames as public, and make use of its methods in the code
[ https://issues.apache.org/jira/browse/LUCENE-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837446#action_12837446 ] Marvin Humphrey commented on LUCENE-2282: - It seems to me that identifying only core index files conflicts with the idea of pluggable index formats. Presumably plugins would use their own file extensions. Would these belong to the index, according to a detector based off of IndexFileNames? Presumably not, which would either limit the usefulness of such a utility, or outright encourage anti-patterns such as a sweeper that zaps files created by plugins because they aren't "core Lucene" enough. Also, are temporary files "core Lucene"? Lockfiles? Only sometimes? What are the applications that we are trying to support by exposing this API? > Expose IndexFileNames as public, and make use of its methods in the code > > > Key: LUCENE-2282 > URL: https://issues.apache.org/jira/browse/LUCENE-2282 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Shai Erera > Fix For: 3.1 > > > IndexFileNames is useful for applications that extend Lucene, an in > particular those who extend Directory or IndexWriter. It provides useful > constants and methods to query whether a certain file is a core Lucene file > or not. In addition, IndexFileNames should be used by Lucene's code to > generate segment file names, or query whether a certain file matches a > certain extension. > I'll post the patch shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836197#action_12836197 ] Marvin Humphrey commented on LUCENE-2271: - An awful lot of thought went into optimizing those collection algorithms. I disagree with many of the design decisions that were made, but it seems rushed to blithely revert those optimizations. FWIW, the SortCollector in KS (designed on the Lucy list last spring, would be in Lucy but some prereqs haven't gone in yet) doesn't have the problem with -Inf sentinels. It uses an array of "actions" representing sort rules to determine whether a hit is "competitive" and should be inserted into the queue; the first action is set to AUTO_ACCEPT (meaning try inserting the hit into the queue) until the queue fills up, and then again to AUTO_ACCEPT at the start of each segment. It's not necessary to fill up the queue with dummy hits beforehand. {code:none} static INLINE bool_t SI_competitive(SortCollector *self, i32_t doc_id) { u8_t *const actions = self->actions; u32_t i = 0; /* Iterate through our array of actions, returning as quickly as * possible. */ do { switch (actions[i] & ACTIONS_MASK) { case AUTO_ACCEPT: return true; case AUTO_REJECT: return false; case AUTO_TIE: break; case COMPARE_BY_SCORE: { float score = Matcher_Score(self->matcher); if (score > self->bubble_score) { self->bumped->score = score; return true; } else if (score < self->bubble_score) { return false; } } break; case COMPARE_BY_SCORE_REV: { // ... case COMPARE_BY_DOC_ID: // ... case COMPARE_BY_ORD1: { // ... } } while (++i < self->num_actions); /* If we've made it this far and we're still tied, reject the doc so that * we prefer items already in the queue. This has the effect of * implicitly breaking ties by doc num, since docs are collected in order. */ return false; } {code} > Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect > results with TopScoreDocCollector > -- > > Key: LUCENE-2271 > URL: https://issues.apache.org/jira/browse/LUCENE-2271 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.9 >Reporter: Uwe Schindler >Priority: Minor > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, > LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch > > > This is a foolowup to LUCENE-2270, where a part of this problem was fixed > (boost = 0 leading to NaN scores, which is also un-intuitive), but in > general, function queries in Solr can create these invalid scores easily. In > previous version of Lucene these scores ordered correct (except NaN, which > mixes up results), but never invalid document ids are returned (like > Integer.MAX_VALUE). > The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel > ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ > to work, this sentinel must be smaller than all posible values, which is not > the case: > - -inf is equal and the document is not inserted into the HQ, as not > competitive, but the HQ is not yet full, so the sentinel values keep in the > HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and > only affects the Ordered collector) by chaning the exit condition to: > {code} > if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) { > // Since docs are returned in-order (i.e., increasing doc Id), a document > // with equal score to pqTop.score cannot compete since HitQueue favors > // documents with lower doc Ids. Therefore reject those docs too. > return; > } > {code} > - The NaN case can be fixed in the same way, but then has another problem: > all comparisons with NaN result in false (none of these is true): x < NaN, x > > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns > false, leading to unexspected ordering in the PQ and sometimes the sentinel > values do not stay at the top of the queue. A later
[jira] Commented: (LUCENE-1941) MinPayloadFunction returns 0 when only one payload is present
[ https://issues.apache.org/jira/browse/LUCENE-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832989#action_12832989 ] Marvin Humphrey commented on LUCENE-1941: - > off on "vacation" (scare quotes for Marvin) Have "fun"! > MinPayloadFunction returns 0 when only one payload is present > - > > Key: LUCENE-1941 > URL: https://issues.apache.org/jira/browse/LUCENE-1941 > Project: Lucene - Java > Issue Type: Bug > Components: Query/Scoring >Affects Versions: 2.9, 3.0 >Reporter: Erik Hatcher > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-1941.patch, LUCENE-1941.patch > > > In some experiments with payload scoring through PayloadTermQuery, I'm seeing > 0 returned when using MinPayloadFunction. I believe there is a bug there. > No time at the moment to flesh out a unit test, but wanted to report it for > tracking. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Having a default constructor in Analyzers
DM Smith: > Imagine that each index maintains a manifest of the toolchain for the index, > which includes the version of each part of the chain. Since the index is > created all at once, this probably is the same as the version of lucene. > When the user searches the index the manifest is consulted to recreate the > toolchain. >8 snip 8< > IIRC: This is something that Marvin has implemented in Lucy. Yes. QueryParser's constructor takes a Schema argument. Furthermore, Schema definitions are fully externalized and stored as JSON with the index itself. So you can do stuff like this: IndexReader reader = IndexReader.open("/path/to/index"); QueryParser qparser = new QueryParser(reader.getSchema()); We haven't got Version for our Analyzers yet, but it's planned. I'm following this discussion with interest to see how the deployment of Version plays out with the user base. However, Lucy's approach won't work for Lucene because Lucene allows you to have fields with the same name and completely different semantics. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2213) Small improvements to ArrayUtil.getNextSize
[ https://issues.apache.org/jira/browse/LUCENE-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801433#action_12801433 ] Marvin Humphrey commented on LUCENE-2213: - > if it starts getting used very often for very small arrays, the overhead > will start to matter I think in most cases usage will only occur after an inequality test, when it's known that reallocation will be occurring. In my experience, the overhead of allocation will tend to swamp this kind of calculation. {code} if (needed > capacity) { int amount = ArrayUtil.oversize(needed, RamUsageEstimator.NUM_BYTES_CHAR); buffer = new char[amount]; } {code} > Small improvements to ArrayUtil.getNextSize > --- > > Key: LUCENE-2213 > URL: https://issues.apache.org/jira/browse/LUCENE-2213 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2213.patch, LUCENE-2213.patch, LUCENE-2213.patch > > > Spinoff from java-dev thread "Dynamic array reallocation algorithms" started > on Jan 12, 2010. > Here's what I did: > * Keep the +3 for small sizes > * Added 2nd arg = number of bytes per element. > * Round up to 4 or 8 byte boundary (if it's 32 or 64 bit JRE respectively) > * Still grow by 1/8th > * If 0 is passed in, return 0 back > I also had to remove some asserts in tests that were checking the actual > values returned by this method -- I don't think we should test that (it's an > impl. detail). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2213) Small improvements to ArrayUtil.getNextSize
[ https://issues.apache.org/jira/browse/LUCENE-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801432#action_12801432 ] Marvin Humphrey commented on LUCENE-2213: - Seems like the one permutation of "over" "allocation" and "size" you've omitted is oversize(minimum, width). (It's a style thing, but I try to use get* for accessors and avoid it elsewhere.) > Small improvements to ArrayUtil.getNextSize > --- > > Key: LUCENE-2213 > URL: https://issues.apache.org/jira/browse/LUCENE-2213 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2213.patch, LUCENE-2213.patch, LUCENE-2213.patch > > > Spinoff from java-dev thread "Dynamic array reallocation algorithms" started > on Jan 12, 2010. > Here's what I did: > * Keep the +3 for small sizes > * Added 2nd arg = number of bytes per element. > * Round up to 4 or 8 byte boundary (if it's 32 or 64 bit JRE respectively) > * Still grow by 1/8th > * If 0 is passed in, return 0 back > I also had to remove some asserts in tests that were checking the actual > values returned by this method -- I don't think we should test that (it's an > impl. detail). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2213) Small improvements to ArrayUtil.getNextSize
[ https://issues.apache.org/jira/browse/LUCENE-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800828#action_12800828 ] Marvin Humphrey commented on LUCENE-2213: - Algorithm looks good. The addition of the mandatory second argument works well. Nice work. Looks like there's a typo in the currently unused constant "NUM_BYTES_DOUBLT". As for the tests... Testing that optimizations like these are working properly is a pain, so I understand why you zapped 'em. Sometimes inequality or proportional testing can work in these situations: {code} assertTrue(t.termBuffer().length > t.termLength()); {code} That assertion wouldn't always hold true for this object, because sometimes the term will fill the whole array. And in a perfect world, you'd want to test that each and every array growth happens as expected -- but that's not practical. Still, in my opinion, a fragile, imperfect test in this situation is OK. > Small improvements to ArrayUtil.getNextSize > --- > > Key: LUCENE-2213 > URL: https://issues.apache.org/jira/browse/LUCENE-2213 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2213.patch > > > Spinoff from java-dev thread "Dynamic array reallocation algorithms" started > on Jan 12, 2010. > Here's what I did: > * Keep the +3 for small sizes > * Added 2nd arg = number of bytes per element. > * Round up to 4 or 8 byte boundary (if it's 32 or 64 bit JRE respectively) > * Still grow by 1/8th > * If 0 is passed in, return 0 back > I also had to remove some asserts in tests that were checking the actual > values returned by this method -- I don't think we should test that (it's an > impl. detail). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Dynamic array reallocation algorithms
On Wed, Jan 13, 2010 at 11:46:50AM -0500, Michael McCandless wrote: > If forced to pick, in general, I tend to prefer burning CPU not RAM, > because the CPU is often a one-time burn, whereas RAM ties up storage > for indefinite amounts of time. With our dependence on indexes being RAM-resident for optimum performance, I'd also favor being conservative with RAM. > I think this function should still aim to handle the smallish values, > ie, we shouldn't require every caller to have to handle the small > values themselves. Callers that want to override the small cases can > still do so... The more "helpful" the behavior of getNextSize(), the harder it is to understand what happens when you partially override it. But I guess it's not that big a deal one way or the other. There aren't that many places in Lucene where you might call getNextSize(). There are more such places in Lucy because we have to roll our own string and array classes, and we need finer-grained control over what happens there -- so maybe that explains why I'm not excited about trying to cram all that logic into a shared routine. Putting more logic into getNextSize() would be less of a problem if Lucene's implementation was less convoluted. It's only one line and one comment, but it's deceptively difficult to grok. Looks like some Perl golfer wrote it. ;) Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Dynamic array reallocation algorithms
On Wed, Jan 13, 2010 at 09:43:12AM -0500, Yonik Seeley wrote: > Yeah, something highly optimized for python in C may not be optimal for Java. It looks like that algo was tuned to address poor reallocation performance under Windows 9x. http://svn.python.org/view/python/trunk/Objects/listobject.c?r1=19445&r2=20939 * This over-allocates proportional to the list size, making room * for additional growth. The over-allocation is mild, but is * enough to give linear-time amortized behavior over a long * sequence of appends() in the presence of a poorly-performing * system realloc() (which is a reality, e.g., across all flavors * of Windows, with Win9x behavior being particularly bad -- and * we've still got address space fragmentation problems on Win9x * even with this scheme, although it requires much longer lists to * provoke them than it used to). */ That "3" used to be a "1": http://svn.python.org/view/python/trunk/Objects/listobject.c?r1=35279&r2=35280 > Seems like the right thing is highly dependent on the use case. It seems that way to me. That's why I think it's better to have a simpler routine and to force more responsibility onto the client code. > In this case, the number of byte arrays temporarily being managed can > be maxDoc (in the very worst case) so it's critical not to waste any > space. Yes, we want to make sure it's possible to ask for a specific array size and get that exact array size. (I think this is a bigger problem in Lucy than in Lucene, because we have to simulate bounded arrays with classes). Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Dynamic array reallocation algorithms
On Wed, Jan 13, 2010 at 05:48:08AM -0500, Michael McCandless wrote: > Have you notified python-dev? No, not yet. Is it kosher with the Python license to have copied-and-pasted that comment? It's not credited from what I can see. Small, but we should probably fix that. > Right, and also to strike a balance with not wasting too much > over-allocated but not-yet-used RAM (hence 1/8 growth, not 1/2 or 1). I agree, a smaller size is better. Say you start with an array which hold 800 elements but which has been over-allocated by one eighth (+100 elements = 900). Reallocating at 900 and again at 1000-something isn't that different from reallocating only once at 1000. So long as you avoid reallocating at every increment -- 801, 802, etc -- you have achieved your main goal. Both mild and aggressive over-allocation solve that problem, but aggressive over-allocation also involves significant RAM costs. Where the best balance lies depends on how bad the reallocation performance is in relation to the cost of insertion. An eighth seems pretty good. Doubling seems too high. 1.5, dunno, seems like it could work. According to this superficially credible-seeming comment at stackoverflow, that's what Ruby does: http://stackoverflow.com/questions/1100311/what-is-the-ideal-growth-rate-for-a-dynamically-allocated-array/1100449#1100449 > But, you have to do something different when targetSize < 8, else that > formula doesn't grow. That's right, it was by design. For small sizes, things get tricky. If it's a byte array, you definitely want to round up to the size of a pointer, and as we enter the era of ubiquitous 64-bit processors, rounding up to the nearest multiple of 8 seems proper. But what about arrays of objects? Would we really want every two-element array to reserve space for 8 objects? The way I've got this handled in a forthcoming patch to Lucy is to trust the user about the size they want in most cases and count on them to add the logic for small sizes (as was done in TermBufferImpl.java). The Grow() methods of CharBuf, ByteBuf, and VArray obey the exact amount -- if you want an overallocation, you'll invoke the over-estimator on the argument you supply to Grow(). It's just the incremental appends like VArray's Push(), or CharBuf's Cat_Char() that trigger the automatic over-allocation internally. > Also, I think for smallish sizes we want faster than 12.5% growth, > because the constant-time cost of the mechanics of doing a > reallocation are at that point relatively high (wrt the cost of > copying the bytes over to the new array). But is that important if in most cases where it will grow incrementally, you've already overallocated manually? Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Dynamic array reallocation algorithms
On Tue, Jan 12, 2010 at 10:46:29PM -0500, DM Smith wrote: > So starting at 0, the size is 0. > 0 => 0 > 0 + 1 => 4 > 4 + 1 => 8 > 8 + 1 => 16 > 16 + 1 => 25 > 25 + 1 => 35 > ... > > So I think the copied python comment is correct but not obviously correct. So those numbers are supposed to be where the transitions occur? But that's not where the jumps are. The jumps happen at... 8, 16, 24, 32, 40, 48... ... as you'd expect when adding (input >> 3), which is after all just a more obscure way of writing (input / 8). Sorry for being dense, but I still don't get it. > The addition of 3 or 6 only helps initially, after some point it is just > noise. It has the characteristic of being less aggressive with subsequent > allocations. Well, I have my doubts about whether this actually helps or not. It doesn't seem general purpose enough. For an array of bytes, the desirable behavior is clear -- you really ought to round up to at least the size of a pointer because you're never going to return a non-aligned buffer. Feeding 10 into getNextSize() and getting back 17 is weird -- you should get back either 16 or 24. I also consider it strange that if you ask for 0 you get 3. A lot of the time, if you're asking for 0 it's because the resource may never need to be allocated. And if you know that the resource actually is going to be needed, you're going to write code like this, from TermAttributeImpl.java: termBuffer = new char[ArrayUtil.getNextSize(newSize < MIN_BUFFER_SIZE ? MIN_BUFFER_SIZE : newSize)]; (MIN_BUFFER_SIZE is 10.) But whatever. "3" and "6" look more significant on first read of the code than they actually are, but they're only strange, not detrimental to performance. I'm just ticked off because I spent so much time trying to understand code which turns out to do so little. > I'm not really up on whether this is best, but it is better than the > doubling algorithm that it replaced. I think your suggestion that such an > algorithm might contribute to fragmented memory is interesting. I wonder if > C, perl and java have different issues regarding that. It's wherever the memory allocator lives. That could be the JVM, it could be glibc, it could be your own custom allocator. If you compile Perl with -DUSE_MY_MALLOC it uses its own allocator, otherwise it uses the system's malloc. KinoSearch actually has a dedicated allocator it uses for a very targeted purpose, and this allocator has its own strategy for avoiding fragmentation. The golden mean issue is relevant to all of those allocators. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Dynamic array reallocation algorithms
Greets, I've been trying to understand this comment regarding ArrayUtil.getNextSize(): * The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ... Maybe I'm missing something, but I can't see how the formula yields such a growth pattern: return (targetSize >> 3) + (targetSize < 9 ? 3 : 6) + targetSize; For input values of 9 or greater, all that formula does is multiply by 1.125 and add 6. (Values enumerated below my sig.) The comment appears to have originated with this Python commit: http://svn.python.org/view/python/trunk/Objects/listobject.c?r1=35279&r2=35280 I think it was wrong then, and it's wrong now. The primary purpose of getNextSize() is to minimize reallocations during dynamic array resizing by overallocating on certain actions. Exactly how much we overallocate by doesn't seem to matter that much. Python apparently adds an extra eighth or so. Ruby reportedly multiplies by 1.5. Theoretically, multipliers larger than the golden mean are supposed to be suboptimal because they tend to induce memory fragmentation: subsequent reallocations cannot reuse previously freed sections, because they never add up to the total required by the newly requested fragment. However, that assumes a reasonably closed memory usage pattern, and so long as the freed fragment can be reused by someone else somewhere, it won't go to waste. IMO, minimizing memory fragmentation is so dependent on the internal implementation of the system's memory allocator as to be not worth the trouble, but if we were to do it, I think the right approach is outlined in this comment documenting the intention of the Python resizing routine prior to the commit that introduced the new (broken?) algo: http://svn.python.org/view/python/trunk/Objects/listobject.c?revision=35125&view=markup /* Round up: * If n < 256, to a multiple of8. * If n < 2048, to a multiple of 64. * If n < 16384, to a multiple of 512. * If n <131072, to a multiple of 4096. * If n < 1048576, to a multiple of32768. * If n < 8388608, to a multiple of 262144. * If n < 67108864, to a multiple of 2097152. * If n < 536870912, to a multiple of 16777216. * ... * If n < 2**(5+3*i), to a multiple of 2**(3*i). I can't really see the point of adding the small constant (6) for large values, as is done in the new algo. And if oversizing is important for small values (debatable, since there will always be lots of small memory chunks floating around in the allocation pool), then rounding up to 8 consistently makes more sense to me than the current behavior. IMO, just overallocating by some multiplier between 1.125 and 1.5 achieves our primary goal of avoiding pathological reallocation behavior, and that's enough. How about simplifying ArrayUtil.getNextSize() down to this? return targetSize + (targetSize / 8); Marvin Humphrey mar...@smokey:~ $ perl -le 'print "$_ => " . (($_ >> 3) + ($_ < 9 ? 3 : 6 ) + $_) for 0 .. 100' 0 => 3 1 => 4 2 => 5 3 => 6 4 => 7 5 => 8 6 => 9 7 => 10 8 => 12 9 => 16 10 => 17 11 => 18 12 => 19 13 => 20 14 => 21 15 => 22 16 => 24 17 => 25 18 => 26 19 => 27 20 => 28 21 => 29 22 => 30 23 => 31 24 => 33 25 => 34 26 => 35 27 => 36 28 => 37 29 => 38 30 => 39 31 => 40 32 => 42 33 => 43 34 => 44 35 => 45 36 => 46 37 => 47 38 => 48 39 => 49 40 => 51 41 => 52 42 => 53 43 => 54 44 => 55 45 => 56 46 => 57 47 => 58 48 => 60 49 => 61 50 => 62 51 => 63 52 => 64 53 => 65 54 => 66 55 => 67 56 => 69 57 => 70 58 => 71 59 => 72 60 => 73 61 => 74 62 => 75 63 => 76 64 => 78 65 => 79 66 => 80 67 => 81 68 => 82 69 => 83 70 => 84 71 => 85 72 => 87 73 => 88 74 => 89 75 => 90 76 => 91 77 => 92 78 => 93 79 => 94 80 => 96 81 => 97 82 => 98 83 => 99 84 => 100 85 => 101 86 => 102 87 => 103 88 => 105 89 => 106 90 => 107 91 => 108 92 => 109 93 => 110 94 => 111 95 => 112 96 => 114 97 => 115 98 => 116 99 => 117 100 => 118 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Compound File Default
On Tue, Jan 12, 2010 at 11:05:13AM -0500, Grant Ingersoll wrote: > At any rate, I feel pretty safe assuming no one is running a production > system on a MBP... I don't really care whether Lucene defaults to the compound file format or not (KS does, Lucy will, and that's good enough for me), but it seems weird that you're assuming that only Mac Book Pros have that default. Just for giggles, I checked my old PowerPC Mac Mini, running 10.5 -- it's got a limit of 256. But beyond that, Lucene adopted the compound file format default for a reason, right? What's changed about the environment that justifies overturning that decision? When I checked a RedHat 9 box several years ago, it was at 1024; when I checked a CentOS 5.2 box today, it was at 1024. A FreeBSD 5.3 box several years ago was at 65536. File descriptor limits don't seem to advance like hardware stats. Go ahead and change the default, but I've got a feeling you're about to relearn old lessons. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Compound File Default
On Tue, Jan 12, 2010 at 09:49:09AM -0500, Grant Ingersoll wrote: > My Mac (non-laptop) reports: > ulimit -n > 2560 > > And I know I didn't change it. Before I posted, I had a few officemates corroborate. 4 people had 256 -- three on 10.6 and me on 10.5. I think these were all Mac Book Pros. The exception was our DBA, who had high numbers (thousands) on both his MBP and his desktop. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Compound File Default
On Mon, Jan 11, 2010 at 03:20:17PM -0500, Grant Ingersoll wrote: > Should we really still be defaulting to true for setUseCompoundFile? Do > people still run out of file handles? Yep. You're going to smack up against that limit pretty quick on Mac OS X: mar...@smokey:~ $ ulimit -n 256 > If so, why not have them turn it on, instead of everyone else having to turn > it off. Can you up the file descriptor limit from within a running JVM? If not, you're setting yourself up with a non-portable default. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794137#action_12794137 ] Marvin Humphrey commented on LUCENE-2026: - > we can't give hints to the OS to tell it not to cache certain reads/writes > (ie segment merging), For what it's worth, we haven't really solved that problem in Lucy either. The sliding window abstraction we wrapped around mmap/MapViewOfFile largely solved the problem of running out of address space on 32-bit operating systems. However, there's currently no way to invoke madvise through Lucy's IO abstraction layer -- it's a little tricky with compound files. Linux, at least, requires that the buffer supplied to madvise be page-aligned. So, say we're starting off on a posting list, and we want to communicate to the OS that it should treat the region we're about to read as MADV_SEQUENTIAL. If the start of the postings file is in the middle of a 4k page and the file right before it is a term dictionary, we don't want to indicate that that region should be treated as sequential. I'm not sure how to solve that problem without violating the encapsulation of the compound file model. Hmm, maybe we could store metadata about the virtual files indicating usage patterns (sequential, random, etc.)? Since files are generally part of dedicated data structures whose usage patterns are known at index time. Or maybe we just punt on that use case and worry only about segment merging. Hmm, wouldn't the act of deleting a file (and releasing all file descriptors) tell the OS that it's free to recycle any memory pages associated with it? > Actually why can't ord & offset be one, for the string sort cache? > Ie, if you write your string data in sort order, then the offsets are > also in sort order? (I think we may have discussed this already?) Right, we discussed this on lucy-dev last spring: http://markmail.org/message/epc56okapbgit5lw Incidentally, some of this thread replays our exchange at the top of LUCENE-1458 from a year ago. It was fun to go back and reread that: in the interrim, we've implemented segment-centric search and memory mapped field caches and term dictionaries, both of which were first discussed back then. :) Ords are great for low cardinality fields of all kinds, but become less efficient for high cardinality primitive numeric fields. For simplicity's sake, the prototype implementation of mmap'd field caches in KS always uses ords. > You don't want to have to create Lucy's equivalent of the JMM... The more I think about making Lucy classes thread safe, the harder it seems. :( I'd like to make it possible to share a Schema across threads, for instance, but that means all its Analyzers, etc have to be thread-safe as well, which isn't practical when you start getting into contributed subclasses. Even if we succeed in getting Folders and FileHandles thread safe, it will be hard for the user to keep track of what they can and can't do across threads. "Don't share anything" is a lot easier to understand. We reap a big benefit by making Lucy's metaclass infrastructure thread-safe. Beyond that, seems like there's a lot of pain for little gain. > Refactoring of IndexWriter > -- > > Key: LUCENE-2026 > URL: https://issues.apache.org/jira/browse/LUCENE-2026 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > I've been thinking for a while about refactoring the IndexWriter into > two main components. > One could be called a SegmentWriter and as the > name says its job would be to write one particular index segment. The > default one just as today will provide methods to add documents and > flushes when its buffer is full. > Other SegmentWriter implementations would do things like e.g. appending or > copying external segments [what addIndexes*() currently does]. > The second component's job would it be to manage writing the segments > file and merging/deleting segments. It would know about > DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would > provide hooks that allow users to manage external data structures and > keep them in sync with Lucene's data during segment merges. > API wise there are things we have to figure out, such as where the > updateDocument() method would fit in, because its deletion part > affects all segments, whereas the new document is only being added to > the new segment. > Of course these sh
[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793918#action_12793918 ] Marvin Humphrey commented on LUCENE-2026: - > Very interesting - thanks. So it also factors in how much the page > was used in the past, not just how long it's been since the page was > last used. In theory, I think that means the term dictionary will tend to be favored over the posting lists. In practice... hard to say, it would be difficult to test. :( > Even smallish indexes can see the pages swapped out? Yes, you're right -- the wait time to get at a small term dictionary isn't necessarily small. I've amended my previous post, thanks. > And of course Java pretty much forces threads-as-concurrency (JVM > startup time, hotspot compilation, are costly). Yes. Java does a lot of stuff that most operating systems can also do, but of course provides a coherent platform-independent interface. In Lucy we're going to try to go back to the OS for some of the stuff that Java likes to take over -- provided that we can develop a sane genericized interface using configuration probing and #ifdefs. It's nice that as long as the box is up our OS-as-JVM is always running, so we don't have to worry about its (quite lengthy) startup time. > Right, this is how Lucy would force warming. I think slurp-instead-of-mmap is orthogonal to warming, because we can warm file-backed RAM structures by forcing them into the IO cache, using either the cat-to-dev-null trick or something more sophisticated. The slurp-instead-of-mmap setting would cause warming as a side effect, but the main point would be to attempt to persuade the virtual memory system that certain data structures should have a higher status and not be paged out as quickly. > But, even within that CFS file, these three sub-files will not be > local? Ie you'll still have to hit three pages per "lookup" right? They'll be next to each other in the compound file because CompoundFileWriter orders them alphabetically. For big segments, though, you're right that they won't be right next to each other, and you could possibly incur as many as three page faults when retrieving a sort cache value. But what are the alternatives for variable width data like strings? You need the ords array anyway for efficient comparisons, so what's left are the offsets array and the character data. An array of String objects isn't going to have better locality than one solid block of memory dedicated to offsets and another solid block of memory dedicated to file data, and it's no fewer derefs even if the string object stores its character data inline -- more if it points to a separate allocation (like Lucy's CharBuf does, since it's mutable). For each sort cache value lookup, you're going to need to access two blocks of memory. * With the array of String objects, the first is the memory block dedicated to the array, and the second is the memory block dedicated to the String object itself, which contains the character data. * With the file-backed block sort cache, the first memory block is the offsets array, and the second is the character data array. I think the locality costs should be approximately the same... have I missed anything? > Write-once is good for Lucene too. Hellyeah. > And it seems like Lucy would not need anything crazy-os-specific wrt > threads? It depends on how many classes we want to make thread-safe, and it's not just the OS, it's the host. The bare minimum is simply to make Lucy thread-safe as a library. That's pretty close, because Lucy studiously avoided global variables whenever possible. The only problems that have to be addressed are the VTable_registry Hash, race conditions when creating new subclasses via dynamic VTable singletons, and refcounts on the VTable objects themselves. Once those issues are taken care of, you'll be able to use Lucy objects in separate threads with no problem, e.g. one Searcher per thread. However, if you want to *share* Lucy objects (other than VTables) across threads, all of a sudden we have to start thinking about "synchronized", "volatile", etc. Such constructs may not be efficient or even possible under some threading models. > Hmm I'd guess that field cache is slowish; deleted docs & norms are > very fast; terms index is somewhere in between. That jibes with my own experience. So maybe consider file-backed sort caches in Lucene, while keeping the status quo for everything else? > You're right, you'd get two readers for seg_12 in that case. By > "pool" I meant you're tapping into all the sub-readers that the > existing reader have opened - the rea
[jira] Issue Comment Edited: (LUCENE-2026) Refactoring of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793431#action_12793431 ] Marvin Humphrey edited comment on LUCENE-2026 at 12/23/09 3:54 AM: --- > I guess my confusion is what are all the other benefits of using > file-backed RAM? You can efficiently use process only concurrency > (though shared memory is technically an option for this too), and you > have wicked fast open times (but, you still must warm, just like > Lucene). Processes are Lucy's primary concurrency model. ("The OS is our JVM.") Making process-only concurrency efficient isn't optional -- it's a *core* *concern*. > What else? Oh maybe the ability to inform OS not to cache > eg the reads done when merging segments. That's one I sure wish > Lucene could use... Lightweight searchers mean architectural freedom. Create 2, 10, 100, 1000 Searchers without a second thought -- as many as you need for whatever app architecture you just dreamed up -- then destroy them just as effortlessly. Add another worker thread to your search server without having to consider the RAM requirements of a heavy searcher object. Create a command-line app to search a documentation index without worrying about daemonizing it. Etc. If your normal development pattern is a single monolithic Java process, then that freedom might not mean much to you. But with their low per-object RAM requirements and fast opens, lightweight searchers are easy to use within a lot of other development patterns. For example: lightweight searchers work well for maxing out multiple CPU cores under process-only concurrency. > In exchange you risk the OS making poor choices about what gets > swapped out (LRU policy is too simplistic... not all pages are created > equal), The Linux virtual memory system, at least, is not a pure LRU. It utilizes a page aging algo which prioritizes pages that have historically been accessed frequently even when they have not been accessed recently: {panel} http://sunsite.nus.edu.sg/LDP/LDP/tlk/node40.html The default action when a page is first allocated, is to give it an initial age of 3. Each time it is touched (by the memory management subsystem) it's age is increased by 3 to a maximum of 20. Each time the Kernel swap daemon runs it ages pages, decrementing their age by 1. {panel} And while that system may not be ideal from our standpoint, it's still pretty good. In general, the operating system's virtual memory scheme is going to work fine as designed, for us and everyone else, and minimize memory availability wait times. When will swapping out the term dictionary be a problem? * For indexes where queries are made frequently, no problem. * Foir systems with plenty of RAM, no problem. * For systems that aren't very busy, no problem. * -For small indexes, no problem.- The only situation we're talking about is infrequent queries against -large- indexes on busy boxes where RAM isn't abundant. Under those circumstances, it *might* be noticable that Lucy's term dictionary gets paged out somewhat sooner than Lucene's. But in general, if the term dictionary gets paged out, so what? Nobody was using it. Maybe nobody will make another query against that index until next week. Maybe the OS made the right decision. OK, so there's a vulnerable bubble where the the query rate against -a large index- an index is neither too fast nor too slow, on busy machines where RAM isn't abundant. I don't think that bubble ought to drive major architectural decisions. Let me turn your question on its head. What does Lucene gain in return for the slow index opens and large process memory footprint of its heavy searchers? > I do love how pure the file-backed RAM approach is, but I worry that > down the road it'll result in erratic search performance in certain > app profiles. If necessary, there's a straightforward remedy: slurp the relevant files into RAM at object construction rather than mmap them. The rest of the code won't know the difference between malloc'd RAM and mmap'd RAM. The slurped files won't take up any more space than the analogous Lucene data structures; more likely, they'll take up less. That's the kind of setting we'd hide away in the IndexManager class rather than expose as prominent API, and it would be a hint to index components rather than an edict. > Yeah, that you need 3 files for the string sort cache is a little > spooky... that's 3X the chance of a page fault. Not when using the compound format. > But the CFS construction must also go through the filesystem (like > Lucene) right? So you still incur IO load of creating the smal
[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793431#action_12793431 ] Marvin Humphrey commented on LUCENE-2026: - > I guess my confusion is what are all the other benefits of using > file-backed RAM? You can efficiently use process only concurrency > (though shared memory is technically an option for this too), and you > have wicked fast open times (but, you still must warm, just like > Lucene). Processes are Lucy's primary concurrency model. ("The OS is our JVM.") Making process-only concurrency efficient isn't optional -- it's a *core* *concern*. > What else? Oh maybe the ability to inform OS not to cache > eg the reads done when merging segments. That's one I sure wish > Lucene could use... Lightweight searchers mean architectural freedom. Create 2, 10, 100, 1000 Searchers without a second thought -- as many as you need for whatever app architecture you just dreamed up -- then destroy them just as effortlessly. Add another worker thread to your search server without having to consider the RAM requirements of a heavy searcher object. Create a command-line app to search a documentation index without worrying about daemonizing it. Etc. If your normal development pattern is a single monolithic Java process, then that freedom might not mean much to you. But with their low per-object RAM requirements and fast opens, lightweight searchers are easy to use within a lot of other development patterns. For example: lightweight searchers work well for maxing out multiple CPU cores under process-only concurrency. > In exchange you risk the OS making poor choices about what gets > swapped out (LRU policy is too simplistic... not all pages are created > equal), The Linux virtual memory system, at least, is not a pure LRU. It utilizes a page aging algo which prioritizes pages that have historically been accessed frequently even when they have not been accessed recently: {panel} http://sunsite.nus.edu.sg/LDP/LDP/tlk/node40.html The default action when a page is first allocated, is to give it an initial age of 3. Each time it is touched (by the memory management subsystem) it's age is increased by 3 to a maximum of 20. Each time the Kernel swap daemon runs it ages pages, decrementing their age by 1. {panel} And while that system may not be ideal from our standpoint, it's still pretty good. In general, the operating system's virtual memory scheme is going to work fine as designed, for us and everyone else, and minimize memory availability wait times. When will swapping out the term dictionary be a problem? * For indexes where queries are made frequently, no problem. * Foir systems with plenty of RAM, no problem. * For systems that aren't very busy, no problem. * For small indexes, no problem. The only situation we're talking about is infrequent queries against large indexes on busy boxes where RAM isn't abundant. Under those circumstances, it *might* be noticable that Lucy's term dictionary gets paged out somewhat sooner than Lucene's. But in general, if the term dictionary gets paged out, so what? Nobody was using it. Maybe nobody will make another query against that index until next week. Maybe the OS made the right decision. OK, so there's a vulnerable bubble where the the query rate against a large index is neither too fast nor too slow, on busy machines where RAM isn't abundant. I don't think that bubble ought to drive major architectural decisions. Let me turn your question on its head. What does Lucene gain in return for the slow index opens and large process memory footprint of its heavy searchers? > I do love how pure the file-backed RAM approach is, but I worry that > down the road it'll result in erratic search performance in certain > app profiles. If necessary, there's a straightforward remedy: slurp the relevant files into RAM at object construction rather than mmap them. The rest of the code won't know the difference between malloc'd RAM and mmap'd RAM. The slurped files won't take up any more space than the analogous Lucene data structures; more likely, they'll take up less. That's the kind of setting we'd hide away in the IndexManager class rather than expose as prominent API, and it would be a hint to index components rather than an edict. > Yeah, that you need 3 files for the string sort cache is a little > spooky... that's 3X the chance of a page fault. Not when using the compound format. > But the CFS construction must also go through the filesystem (like > Lucene) right? So you still incur IO load of creating the small > files, then 2nd pass to consolidate. Yes. > I think we m
[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792939#action_12792939 ] Marvin Humphrey commented on LUCENE-2026: - > But, that's where Lucy presumably takes a perf hit. Lucene can share > these in RAM, not usign the filesystem as the intermediary (eg we do > that today with deletions; norms/field cache/eventual CSF can do the > same.) Lucy must go through the filesystem to share. For a flush(), I don't think there's a significant penalty. The only extra costs Lucy will pay are the bookkeeping costs to update the file system state and to create the objects that read the index data. Those are real, but since we're skipping the fsync(), they're small. As far as the actual data, I don't see that there's a difference. Reading from memory mapped RAM isn't any slower than reading from malloc'd RAM. If we have to fsync(), there'll be a cost, but in Lucene you have to pay that same cost, too. Lucene expects to get around it with IndexWriter.getReader(). In Lucy, we'll get around it by having you call flush() and then reopen a reader somewhere, often in another proecess. * In both cases, the availability of fresh data is decoupled from the fsync. * In both cases, the indexing process has to be careful about dropping data on the floor before a commit() succeeds. * In both cases, it's possible to protect against unbounded corruption by rolling back to the last commit. > Mostly I was thinking performance, ie, trusting the OS to make good > decisions about what should be RAM resident, when it has limited > information... Right, for instance because we generally can't force the OS to pin term dictionaries in RAM, as discussed a while back. It's not an ideal situation, but Lucene's approach isn't bulletproof either, since Lucene's term dictionaries can get paged out too. We're sure not going to throw away all the advantages of mmap and go back to reading data structures into process RAM just because of that. > But, also risky is that all important data structures must be "file-flat", > though in practice that doesn't seem like an issue so far? It's a constraint. For instance, to support mmap, string sort caches currently require three "files" each: ords, offsets, and UTF-8 character data. The compound file system makes the file proliferation bearable, though. And it's actually nice in a way to have data structures as named files, strongly separated from each other and persistent. If we were willing to ditch portability, we could cast to arrays of structs in Lucy -- but so far we've just used primitives. I'd like to keep it that way, since it would be nice if the core Lucy file format was at least theoretically compatible with a pure Java implementation. But Lucy plugins could break that rule and cast to structs if desired. > The RAM resident things Lucene has - norms, deleted docs, terms index, field > cache - seem to "cast" just fine to file-flat. There are often benefits to keeping stuff "file-flat", particularly when the file-flat form is compressed. If we were to expand those sort caches to string objects, they'd take up more RAM than they do now. I think the only significant drawback is security: we can't trust memory mapped data the way we can data which has been read into process RAM and checked on the way in. For instance, we need to perform UTF-8 sanity checking each time a string sort cache value escapes the controlled environment of the cache reader. If the sort cache value was instead derived from an existing string in process RAM, we wouldn't need to check it. > If we switched to an FST for the terms index I guess that could get > tricky... Hmm, I haven't been following that. Too much work to keep up with those giganto patches for flex indexing, even though it's a subject I'm intimately acquainted with and deeply interested in. I plan to look it over when you're done and see if we can simplify it. :) > Wouldn't shared memory be possible for process-only concurrent models? IPC is a platform-compatibility nightmare. By restricting ourselves to communicating via the file system, we save ourselves oodles of engineering time. And on really boring, frustrating work, to boot. > Also, what popular systems/environments have this requirement (only process > level concurrency) today? Perl's threads suck. Actually all threads suck. Perl's are just worse than average -- and so many Perl binaries are compiled without them. Java threads suck less, but they still suck -- look how much engineering time you folks blow on managing that stuff. Threads are a terrible p
[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792638#action_12792638 ] Marvin Humphrey commented on LUCENE-2026: - Yes, this is using the sort cache model worked out this spring on lucy-dev. The memory mapping happens within FSFileHandle (LUCY-83). SortWriter and SortReader haven't made it into the Lucy repository yet. > Refactoring of IndexWriter > -- > > Key: LUCENE-2026 > URL: https://issues.apache.org/jira/browse/LUCENE-2026 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > I've been thinking for a while about refactoring the IndexWriter into > two main components. > One could be called a SegmentWriter and as the > name says its job would be to write one particular index segment. The > default one just as today will provide methods to add documents and > flushes when its buffer is full. > Other SegmentWriter implementations would do things like e.g. appending or > copying external segments [what addIndexes*() currently does]. > The second component's job would it be to manage writing the segments > file and merging/deleting segments. It would know about > DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would > provide hooks that allow users to manage external data structures and > keep them in sync with Lucene's data during segment merges. > API wise there are things we have to figure out, such as where the > updateDocument() method would fit in, because its deletion part > affects all segments, whereas the new document is only being added to > the new segment. > Of course these should be lower level APIs for things like parallel > indexing and related use cases. That's why we should still provide > easy to use APIs like today for people who don't need to care about > per-segment ops during indexing. So the current IndexWriter could > probably keeps most of its APIs and delegate to the new classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792625#action_12792625 ] Marvin Humphrey commented on LUCENE-2026: - > Well, autoCommit just means "periodically call commit". So, if you > decide to offer a commit() operation, then autoCommit would just wrap > that? But, I don't think autoCommit should be offered... app should > decide. Agreed, autoCommit had benefits under legacy Lucene, but wouldn't be important now. If we did add some sort of "automatic commit" feature, it would mean something else: commit every change instantly. But that's easy to implement via a wrapper, so there's no point cluttering the the primary index writer class to support such a feature. > Again: NRT is not a "specialized reader". It's a normal read-only > DirectoryReader, just like you'd get from IndexReader.open, with the > only difference being that it consulted IW to find which segments to > open. Plus, it's pooled, so that if IW already has a given segment > reader open (say because deletes were applied or merges are running), > it's reused. Well, it seems to me that those two features make it special -- particularly the pooling of SegmentReaders. You can't take advantage of that outside the context of IndexWriter: > Yes, Lucene's approach must be in the same JVM. But we get important > gains from this - reusing a single reader (the pool), carrying over > merged deletions directly in RAM (and eventually field cache & norms > too - LUCENE-1785). Exactly. In my view, that's what makes that reader "special": unlike ordinary Lucene IndexReaders, this one springs into being with its caches already primed rather than in need of lazy loading. But to achieve those benefits, you have to mod the index writing process. Those modifications are not necessary under the Lucy model, because the mere act of writing the index stores our data in the system IO cache. > Instead, Lucy (by design) must do all sharing & access all index data > through the filesystem (a decision, I think, could be dangerous), > which will necessarily increase your reopen time. Dangerous in what sense? Going through the file system is a tradeoff, sure -- but it's pretty nice to design your low-latency search app free from any concern about whether indexing and search need to be coordinated within a single process. Furthermore, if separate processes are your primary concurrency model, going through the file system is actually mandatory to achieve best performance on a multi-core box. Lucy won't always be used with multi-threaded hosts. I actually think going through the file system is dangerous in a different sense: it puts pressure on the file format spec. The easy way to achieve IPC between writers and readers will be to dump stuff into one of the JSON files to support the killer-feature-du-jour -- such as what I'm proposing with this "fsync" key in the snapshot file. But then we wind up with a bunch of crap cluttering up our index metadata files. I'm determined that Lucy will have a more coherent file format than Lucene, but with this IPC requirement we're setting our community up to push us in the wrong direction. If we're not careful, we could end up with a file format that's an unmaintainable jumble. But you're talking performance, not complexity costs, right? > Maybe in practice that cost is small though... the OS write cache should > keep everything fresh... but you still must serialize. Anecdotally, at Eventful one of our indexes is 5 GB with 16 million records and 900 MB worth of sort cache data; opening a fresh searcher and loading all sort caches takes circa 21 ms. There's room to improve that further -- we haven't yet implemented IndexReader.reopen() -- but that was fast enough to achieve what we wanted to achieve. > Refactoring of IndexWriter > -- > > Key: LUCENE-2026 > URL: https://issues.apache.org/jira/browse/LUCENE-2026 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > I've been thinking for a while about refactoring the IndexWriter into > two main components. > One could be called a SegmentWriter and as the > name says its job would be to write one particular index segment. The > default one just as today will provide methods to add documents and > flushes when its buffer is full. > Other SegmentWriter implementations would do things like e.g. appending
[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791549#action_12791549 ] Marvin Humphrey commented on LUCENE-2026: - >> Wasn't that a possibility under autocommit as well? All it takes is for the >> OS to finish flushing the new snapshot file to persistent storage before it >> finishes flushing a segment data file needed by that snapshot, and for the >> power failure to squeeze in between. > > Not after LUCENE-1044... autoCommit simply called commit() at certain > opportune times (after finish big merges), which does the right thing (I > hope!). The segments file is not written until all files it references are > sync'd. FWIW, autoCommit doesn't really have a place in Lucy's one-segment-per-indexing-session model. Revisiting the LUCENE-1044 threads, one passage stood out: {panel} http://www.gossamer-threads.com/lists/lucene/java-dev/54321#54321 This is why in a db system, the only file that is sync'd is the log file - all other files can be made "in sync" from the log file - and this file is normally striped for optimum write performance. Some systems have special "log file drives" (some even solid state, or battery backed ram) to aid the performance. {panel} The fact that we have to sync all files instead of just one seems sub-optimal. Yet Lucene is not well set up to maintain a transaction log. The very act of adding a document to Lucene is inherently lossy even if all fields are stored, because doc boost is not preserved. > Also, having the app explicitly decouple these two notions keeps the > door open for future improvements. If we force absolutely all sharing > to go through the filesystem then that limits the improvements we can > make to NRT. However, Lucy has much more to gain going through the file system than Lucene does, because we don't necessarily incur JVM startup costs when launching a new process. The Lucene approach to NRT -- specialized reader hanging off of writer -- is constrained to a single process. The Lucy approach -- fast index opens enabled by mmap-friendly index formats -- is not. The two approaches aren't mutually exclusive. It will be possible to augment Lucy with a specialized index reader within a single process. However, A) there seems to be a lot of disagreement about just how to integrate that reader, and B) there seem to be ways to bolt that functionality on top of the existing classes. Under those circumstances, I think it makes more sense to keep that feature external for now. > Alternatively, you could keep the notion "flush" (an unsafe commit) > alive? You write the segments file, but make no effort to ensure it's > durability (and also preserve the last "true" commit). Then a normal > IR.reopen suffices... That sounds promising. The semantics would differ from those of Lucene's flush(), which doesn't make changes visible. We could implement this by somehow marking a "committed" snapshot and a "flushed" snapshot differently, either by adding an "fsync" property to the snapshot file that would be false after a flush() but true after a commit(), or by encoding the property within the snapshot filename. The file purger would have to ensure that all index files referenced by either the last committed snapshot or the last flushed snapshot were off limits. A rollback() would zap all changes since the last commit(). Such a scheme allows the the top level app to avoid the costs of fsync while maintaining its own transaction log -- perhaps with the optimizations suggested above (separate disk, SSD, etc). > Refactoring of IndexWriter > -- > > Key: LUCENE-2026 > URL: https://issues.apache.org/jira/browse/LUCENE-2026 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > I've been thinking for a while about refactoring the IndexWriter into > two main components. > One could be called a SegmentWriter and as the > name says its job would be to write one particular index segment. The > default one just as today will provide methods to add documents and > flushes when its buffer is full. > Other SegmentWriter implementations would do things like e.g. appending or > copying external segments [what addIndexes*() currently does]. > The second component's job would it be to manage writing the segments > file and merging/deleting segments. It would know about > DeletionPolicy, MergePolicy and Me
[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789905#action_12789905 ] Marvin Humphrey commented on LUCENE-2026: - > I think that's a poor default (trades safety for performance), unless > Lucy eg uses a transaction log so you can concretely bound what's lost > on crash/power loss. Or, if you go back to autocommitting I guess... Search indexes should not be used for canonical data storage -- they should be built *on top of* canonical data storage. Guarding against power failure induced corruption in a database is an imperative. Guarding against power failure induced corruption in a search index is a feature, not an imperative. Users have many options for dealing with the potential for such corruption. You can go back to your canonical data store and rebuild your index from scratch when it happens. In a search cluster environment, you can rsync a known-good copy from another node. Potentially, you might enable fsync-before-commit and keep your own transaction log. However, if the time it takes to rebuild or recover an index from scratch would have caused you unacceptable downtime, you can't possibly be operating in a single-point-of-failure environment where a power failure could take you down anyway -- so other recovery options are available to you. Turning on fsync is only one step towards ensuring index integrity; others steps involve making decisions about hard drives, RAID arrays, failover strategies, network and off-site backups, etc, and are outside of our domain as library authors. We cannot meet the needs of users who need guaranteed index integrity on our own. For everybody else, what turning on fsync by default achieves is to make an exceedingly rare event rarer. That's valuable, but not essential. My argument is that since the search indexes should not be used for canonical storage, and since fsync is not testably reliable and not sufficient on its own, it's a good engineering compromise to prioritize performance. > If we did this in Lucene, you can have unbounded corruption. It's not > just the last few minutes of updates... Wasn't that a possibility under autocommit as well? All it takes is for the OS to finish flushing the new snapshot file to persistent storage before it finishes flushing a segment data file needed by that snapshot, and for the power failure to squeeze in between. In practice, locality of reference is going to make the window very very small, since those two pieces of data will usually get written very close to each other on the persistent media. I've seen a lot more messages to our user lists over the years about data corruption caused by bugs and misconfigurations than by power failures. But really, that's as it should be. Ensuring data integrity to the degree required by a database is costly -- it requires far more rigorous testing, and far more conservative development practices. If we accept that our indexes must *never* go corrupt, it will retard innovation. Of course we should work very hard to prevent index corruption. However, I'm much more concerned about stuff like silent omission of search results due to overzealous, overly complex optimizations than I am about problems arising from power failures. When a power failure occurs, you know it -- so you get the opportunity to fsck the disk, run checkIndex(), perform data integrity reconciliation tests against canonical storage, and if anything fails, take whatever recovery actions you deem necessary. > You don't need to turn off sync for NRT - that's the whole point. It > gives you a reader without syncing the files. I suppose this is where Lucy and Lucene differ. Thanks to mmap and the near-instantaneous reader opens it has enabled, we don't need to keep a special reader alive. Since there's no special reader, the only way to get data to a search process is to go through a commit. But if we fsync on every commit, we'll drag down indexing responsiveness. Fishishing the commit and returning control to client code as quickly as possible is a high priority for us. Furthermore, I don't want us to have to write the code to support a near-real-time reader hanging off of IndexWriter a la Lucene. The architectural discussions have made for very interesting reading, but the design seems to be tricky to pull off, and implementation simplicity in core search code is a high priority for Lucy. It's better for Lucy to kill two birds with one stone and concentrate on making *all* index opens fast. > Really, this is your safety tradeoff - it means you can commit less > frequently, since the NRT reader can search the latest updates. But, your > app has complete control over how it wants to to trade safety for > performance. So long
[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput
[ https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789895#action_12789895 ] Marvin Humphrey commented on LUCENE-2126: - > I disagree with you here: introducing DataInput/Output makes IMO the API > actually easier for the "normal" user to understand. > > I would think that most users don't implement IndexInput/Output extensions, > but simply use the out-of-the-box Directory implementations, which provide > IndexInput/Output impls. Also, most users probably don't even call the > IndexInput/Output APIs directly. I agree with everything you say in the second paragraph, but I don't see how any of that supports the assertion you make in the first paragraph. Lucene's file system has a directory class, named "Directory", and a pair of classes which representing files, named "IndexInput" and "IndexOutput". Directories and files. Easy to understand. All common IO systems have entities which represent data streaming to/from a file. They might be called "file handles", "file descriptors", "readers" and "writers", "streams", or whatever, but they're all basically the same thing. What this patch does is fragment the pair of classes that representing file IO... why? What does a "normal" user do with a file? Step 1: Open the file. Step 2: Write data to the file. Step 3: Close the file. Then, later... Step 1: Open the file. Step 2: Read data from the file. Step 3: Close the file. You're saying that Lucene's file abstraction is easier to understand if you break that up? I grokked your first rationale -- that you don't want people to be able to call close() on an IndexInput that they're essentially borrowing for a bit. OK, I think it's overkill to create an entire class to thwart something nobody was going to do anyway, but at least I understand why you might want to do that. But the idea that this strange fragmentation of the IO hierarchy makes things *easier* -- I don't get it at all. And I certainly don't see how it's such an improvement over what exists now that it justifies a change to the public API. > Split up IndexInput and IndexOutput into DataInput and DataOutput > - > > Key: LUCENE-2126 > URL: https://issues.apache.org/jira/browse/LUCENE-2126 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: Flex Branch >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Flex Branch > > Attachments: lucene-2126.patch > > > I'd like to introduce the two new classes DataInput and DataOutput > that contain all methods from IndexInput and IndexOutput that actually > decode or encode data, such as readByte()/writeByte(), > readVInt()/writeVInt(). > Methods like getFilePointer(), seek(), close(), etc., which are not > related to data encoding, but to files as input/output source stay in > IndexInput/IndexOutput. > This patch also changes ByteSliceReader/ByteSliceWriter to extend > DataInput/DataOutput. Previously ByteSliceReader implemented the > methods that stay in IndexInput by throwing RuntimeExceptions. > See also LUCENE-2125. > All tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789614#action_12789614 ] Marvin Humphrey commented on LUCENE-2026: - > I say it's better to sacrifice write guarantee. I don't grok why sync is the default, especially given how sketchy hardware drivers are about obeying fsync: {panel} But, beware: some hardware devices may in fact cache writes even during fsync, and return before the bits are actually on stable storage, to give the appearance of faster performance. {panel} IMO, it should have been an option which defaults to false, to be enabled only by users who have the expertise to ensure that fsync() is actually doing what it advertises. But what's done is done (and Lucy will probably just do something different.) With regard to Lucene NRT, though, turning sync() off would really help. If and when some sort of settings class comes about, an enableSync(boolean enabled) method seems like it would come in handy. > Refactoring of IndexWriter > -- > > Key: LUCENE-2026 > URL: https://issues.apache.org/jira/browse/LUCENE-2026 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > I've been thinking for a while about refactoring the IndexWriter into > two main components. > One could be called a SegmentWriter and as the > name says its job would be to write one particular index segment. The > default one just as today will provide methods to add documents and > flushes when its buffer is full. > Other SegmentWriter implementations would do things like e.g. appending or > copying external segments [what addIndexes*() currently does]. > The second component's job would it be to manage writing the segments > file and merging/deleting segments. It would know about > DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would > provide hooks that allow users to manage external data structures and > keep them in sync with Lucene's data during segment merges. > API wise there are things we have to figure out, such as where the > updateDocument() method would fit in, because its deletion part > affects all segments, whereas the new document is only being added to > the new segment. > Of course these should be lower level APIs for things like parallel > indexing and related use cases. That's why we should still provide > easy to use APIs like today for people who don't need to care about > per-segment ops during indexing. So the current IndexWriter could > probably keeps most of its APIs and delegate to the new classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput
[ https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788098#action_12788098 ] Marvin Humphrey commented on LUCENE-2126: - > These methods should only be able to call the read/write methods (which this > issue moves to DataInput/Output), but not methods like close(), seek() etc.. Ah, so that's what it is. In that case, let me vote my (non-binding) -1. I don't believe that the enforcement of such a restriction justifies the complexity cost of adding a new class to the public API. First, adding yet another class to the hierarchy steepens the learning curve for users and contributors. If you aren't in the rarefied echelon of exceptional brilliance occupied by people named Michael who work for IBM :), the gradual accumulation of complexity in the Lucene code base matters. Inch by inch, things move out of reach. Second, changing things now for what seems to me like a minor reason makes it harder to refactor the class hierarchy in the future when other, more important reasons are inevitably discovered. For LUCENE-2125, I recommend two possible options. * Do nothing and assume that the sort of advanced user who writes a posting codec won't do something incredibly stupid like call indexInput.close(). * Add a note to the docs for writing posting codecs indicating which sort of of IO methods you ought not to call. > once we see a need to allow users to extend DataInput/Output outside of > Lucene we can go ahead and make the additional changes that are mentioned in > your in my comments here. In Lucy, there are three tiers of IO usage: * For low-level IO, use FileHandle. * For most applications, use InStream's encoder/decoder methods. * For performance-critical inner-loop material (e.g. posting decoders, SortCollector), access the raw memory-mapped IO buffer using InStream_Buf()/InStream_Advance_Buf() and use static inline functions such as NumUtil_decode_c32 (which does no bounds checking) from Lucy::Util::NumberUtils. While you can extend InStream to add a codec, that's not generally the best way to go about it, because adding a method to InStream requires that all of your users both use your InStream class and use a subclassed Folder which overrides the Folder_Open_In() factory method (analogous to Directory.openInput()). Better is to use the extension point provided by InStream_Buf()/InStream_Advance_Buf() and write a utility function which accepts an InStream as an argument. I don't expect and am not advocating that Lucene adopt the same IO hierarchy as Lucy, but I wanted to provide an example of other reasons why you might change things. (What I'd really like to see is for Lucene to come up with something *better* than the Lucy IO hierarchy.) One of the reasons Lucene has so many backwards compatibility headaches is because the public APIs are so extensive and thus constitute such an elaborate set of backwards compatibility promises. IMO, DataInput and DataOutput do not offer sufficient benefit to compensate for the increased intricacy they add to that backwards compatibility contract. > Split up IndexInput and IndexOutput into DataInput and DataOutput > - > > Key: LUCENE-2126 > URL: https://issues.apache.org/jira/browse/LUCENE-2126 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: Flex Branch >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Flex Branch > > Attachments: lucene-2126.patch > > > I'd like to introduce the two new classes DataInput and DataOutput > that contain all methods from IndexInput and IndexOutput that actually > decode or encode data, such as readByte()/writeByte(), > readVInt()/writeVInt(). > Methods like getFilePointer(), seek(), close(), etc., which are not > related to data encoding, but to files as input/output source stay in > IndexInput/IndexOutput. > This patch also changes ByteSliceReader/ByteSliceWriter to extend > DataInput/DataOutput. Previously ByteSliceReader implemented the > methods that stay in IndexInput by throwing RuntimeExceptions. > See also LUCENE-2125. > All tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput
[ https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787876#action_12787876 ] Marvin Humphrey commented on LUCENE-2126: - I spent a long time today trying to understand why DataInput and DataOutput are justified so that I could formulate an intelligent reply, but I had to give up. :\ Please carry on. > Split up IndexInput and IndexOutput into DataInput and DataOutput > - > > Key: LUCENE-2126 > URL: https://issues.apache.org/jira/browse/LUCENE-2126 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: Flex Branch >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Flex Branch > > Attachments: lucene-2126.patch > > > I'd like to introduce the two new classes DataInput and DataOutput > that contain all methods from IndexInput and IndexOutput that actually > decode or encode data, such as readByte()/writeByte(), > readVInt()/writeVInt(). > Methods like getFilePointer(), seek(), close(), etc., which are not > related to data encoding, but to files as input/output source stay in > IndexInput/IndexOutput. > This patch also changes ByteSliceReader/ByteSliceWriter to extend > DataInput/DataOutput. Previously ByteSliceReader implemented the > methods that stay in IndexInput by throwing RuntimeExceptions. > See also LUCENE-2125. > All tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput
[ https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786959#action_12786959 ] Marvin Humphrey commented on LUCENE-2126: - FWIW, this approach is sort of the inverse of where we've gone with Lucy. In Lucy, low-level unbuffered IO operations are abstracted into FileHandle, which is either a thin wrapper around a POSIX file descriptor (e.g. FSFileHandle under unixen), or a simulation thereof (e.g. FSFileHandle under Windows, RAMFileHandle). Then there are InStream and OutStream, which would be analogous to DataInput and DataOutput, in that they have all the Lucy-specific encoding/decoding methods. However, instead of requiring that subclasses implement the low-level IO operations, InStream "has a" FileHandle and OutStream "has a" FileHandle. The advantage of breaking out FileHandle as a separate class is that if e.g. you extend InStream by adding on PFOR encoding, you automatically get the benefit for all IO implementations. I think that under the DataInput/DataOutput model, that extension technique will only be available to core devs of Lucene, no? More info: * LUCY-58 FileHandle * LUCY-63 InStream and OutStream > Split up IndexInput and IndexOutput into DataInput and DataOutput > - > > Key: LUCENE-2126 > URL: https://issues.apache.org/jira/browse/LUCENE-2126 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: Flex Branch >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: Flex Branch > > Attachments: lucene-2126.patch > > > I'd like to introduce the two new classes DataInput and DataOutput > that contain all methods from IndexInput and IndexOutput that actually > decode or encode data, such as readByte()/writeByte(), > readVInt()/writeVInt(). > Methods like getFilePointer(), seek(), close(), etc., which are not > related to data encoding, but to files as input/output source stay in > IndexInput/IndexOutput. > This patch also changes ByteSliceReader/ByteSliceWriter to extend > DataInput/DataOutput. Previously ByteSliceReader implemented the > methods that stay in IndexInput by throwing RuntimeExceptions. > See also LUCENE-2125. > All tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Resolved: (LUCENE-2119) If you pass Integer.MAX_VALUE as 2nd param to search(Query, int) you hit unexpected NegativeArraySizeException
On Sun, Dec 06, 2009 at 05:31:53PM -0500, Erick Erickson wrote: > This may be a silly question, and I admit that I haven't looked a the code, > but was there a good reason to +1 it in the first place or was that just > paranoia to prevent off-by-one errors? IIRC, this implementation of the priority queue algo leaves open slot 0 to simplify internal calculations. It was that way when I ported 1.4.3, and I doubt that's changed. > If there *was* a valid reason, might it make sense to > +1 min(nDocs, maxDoc())? I think the patch is fine. It's really only needed to provide a more accurate error message in the event somebody specifies that they want Integer.MAX_VALUE elements, not realizing that they will be allocated up front rather than lazily -- they'll get an OOME rather than a NegativeArraySizeException. Cheers, Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1877) Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open)
[ https://issues.apache.org/jira/browse/LUCENE-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781531#action_12781531 ] Marvin Humphrey commented on LUCENE-1877: - >> take it somewhere other than this closed issue. > > Yes, where? The java-dev list: http://markmail.org/message/ivdgmxrivs3jzhfe > Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open) > -- > > Key: LUCENE-1877 > URL: https://issues.apache.org/jira/browse/LUCENE-1877 > Project: Lucene - Java > Issue Type: Improvement > Components: Javadocs >Reporter: Mark Miller >Assignee: Uwe Schindler > Fix For: 2.9 > > Attachments: LUCENE-1877.patch, LUCENE-1877.patch, LUCENE-1877.patch, > LUCENE-1877.patch > > > A user requested we add a note in IndexWriter alerting the availability of > NativeFSLockFactory (allowing you to avoid retaining locks on abnormal jvm > exit). Seems reasonable to me - we want users to be able to easily stumble > upon this class. The below code looks like a good spot to add a note - could > also improve whats there a bit - opening an IndexWriter does not necessarily > create a lock file - that would depend on the LockFactory used. > {code} Opening an IndexWriter creates a lock file for the > directory in use. Trying to open > another IndexWriter on the same directory will lead to a > {...@link LockObtainFailedException}. The {...@link > LockObtainFailedException} > is also thrown if an IndexReader on the same directory is used to delete > documents > from the index.{code} > Anyone remember why NativeFSLockFactory is not the default over > SimpleFSLockFactory? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Socket and file locks
On Sun, Nov 22, 2009 at 10:36:57AM +, Thomas Mueller (JIRA) wrote: > Thomas Mueller commented on LUCENE-1877: > > > > take it somewhere other than this closed issue. > > Yes, where? The java-dev list. > > shouldn't active code like that live in the application layer? > > Why? You can all but guarantee that polling will work at the app layer, because you can have almost full control over process priority. If the polling code is lower down and hidden away, then it worries me that a lock might be swept away by another process, and by the time the original process realizes that it doesn't hold the lock anymore, the damage could already have been done. Unless I'm missing something, it doesn't seem like a failsafe design. But this is "theoretical", I suppose: > I'm just trying to say that in theory, the thread is problematic, but in > practice it isn't. While file locking is not a problem in theory, but in > practice. Heh. :) > > What happens when the app sleeps? > > Good question! Standby / hibernate are not supported. I didn't think about > that. Is there a way to detect the wakeup? Not sure. FYI, I'm only an indirect contributor to Java Lucene. My main projects are Lucy and KinoSearch, loose ports to C. I know the problem domain intimately, but my Java skills are sketchy. > > host name and the pid > > Yes. It is not so easy to get the PID in Java, I found: > http://stackoverflow.com/questions/35842/process-id-in-java > "ManagementFactory.getRuntimeMXBean().getName()". A web search for "java process id" turns up a bazillion hits about how to hack up a PID. How annoying. This seems to me like a case of the perfect being the enemy of the good. How many machines that run Java are running operating systems that have no support for PIDs? Hasn't somebody open sourced a "GiveMeTheFrikkinPID" library yet? > What do you do if the lock was generated by another machine? Require that all machines participating in the writer pool supply a unique host ID as part of the locking API. Store that host ID in the lockfile and only allow machines to sweep stale files that they own. Unfortunately, that's not failsafe either, though: misconfiguration leads to index corruption rather than deadlock, when two machines that use identical host IDs sweep each others lockfiles and write simultaneously. > I tried with using a server socket, so you need the IP address, but > unfortunately, sometimes the network is not configured correctly (but maybe > it's possible to detect that). Maybe the two machines can't access each > other over TCP/IP. This is an intriguing approach. Can it be designed to be failsafe? If the server and the client can't access each other, that's failsafe at least, because the client will simply fail to acquire the lock. But if a client is misconfigured, could it contact the wrong host, successfully open a port that coincidentally happens to be open, believe it has acquired the lock and corrupt the index? If so, could some sort of handshake prevent that? I'm also curious if we can use this approach for read locking. For that, you need a reference counting scheme -- one ref for each reader accessing the index. Is that possible under the socket model? > > hard links > > Yes, but it looks like this doesn't work always. It is theoretically possible for the link() call to return false incorrectly when the hard link has actually been created, for instance because a network problem prevents the "success" packet from getting back to the client from the server. However, this is failsafe, because the requestor will not believe that the lock has been secured and thus won't write. That process won't be able to sweep away the orphaned lock file itself, but once it exits, a graceful recovery will occur. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1877) Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open)
[ https://issues.apache.org/jira/browse/LUCENE-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780647#action_12780647 ] Marvin Humphrey commented on LUCENE-1877: - > http://www.h2database.com/html/advanced.html#file_locking_protocols I'm a little concerned about the suitability of the polling approach for a low-level library like Lucene -- shouldn't active code like that live in the application layer? Is it possible to exceed the polling interval for a low priority process on a system under heavy load? What happens when the app sleeps? For removing stale lock files, another technique is to incorporate the host name and the pid. So long as you can determine that the lock file belongs to your machine and that the PID is not active, you can safely zap it. Then tricky bit is how you get that information into the lock file. If you try to write that info to the lock file itself after an O_EXCL open, creating a fully valid lock file is no longer an atomic operation. The approach suggested by the creat(2) man page and endorsed in the Linux NFS FAQ involves hard links: {noformat} The solution for performing atomic file locking using a lockfile is to create a unique file on the same file system (e.g., incorporating hostname and pid), use link(2) to make a link to the lockfile. If link() returns 0, the lock is successful. Otherwise, use stat(2) on the unique file to check if its link count has increased to 2, in which case the lock is also successful. {noformat} This approach should also work on Windows for NTFS file systems since Windows 2000 thanks to the CreateHardLink() function. (Samba file shares, you're out of luck.) However, I'm not sure about the state of support for hard links in Java. If you're interested in continuing this discussion, we should probably take it somewhere other than this closed issue. > Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open) > -- > > Key: LUCENE-1877 > URL: https://issues.apache.org/jira/browse/LUCENE-1877 > Project: Lucene - Java > Issue Type: Improvement > Components: Javadocs >Reporter: Mark Miller >Assignee: Uwe Schindler > Fix For: 2.9 > > Attachments: LUCENE-1877.patch, LUCENE-1877.patch, LUCENE-1877.patch, > LUCENE-1877.patch > > > A user requested we add a note in IndexWriter alerting the availability of > NativeFSLockFactory (allowing you to avoid retaining locks on abnormal jvm > exit). Seems reasonable to me - we want users to be able to easily stumble > upon this class. The below code looks like a good spot to add a note - could > also improve whats there a bit - opening an IndexWriter does not necessarily > create a lock file - that would depend on the LockFactory used. > {code} Opening an IndexWriter creates a lock file for the > directory in use. Trying to open > another IndexWriter on the same directory will lead to a > {...@link LockObtainFailedException}. The {...@link > LockObtainFailedException} > is also thrown if an IndexReader on the same directory is used to delete > documents > from the index.{code} > Anyone remember why NativeFSLockFactory is not the default over > SimpleFSLockFactory? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2073) Document issues involved in building your index with one jdk version and then searching/updating with another
[ https://issues.apache.org/jira/browse/LUCENE-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779015#action_12779015 ] Marvin Humphrey commented on LUCENE-2073: - > I am pretty sure StandardAnalyzer is ok actually now. Good news! Thanks for performing that analysis. > Document issues involved in building your index with one jdk version and then > searching/updating with another > - > > Key: LUCENE-2073 > URL: https://issues.apache.org/jira/browse/LUCENE-2073 > Project: Lucene - Java > Issue Type: Task >Reporter: Mark Miller > Attachments: LUCENE-2073.patch, LUCENE-2073.patch > > > I think this needs to go in something of a permenant spot - this isn't a one > time release type issues - its going to present over multiple release. > {quote} > If there is nothing we can do here, then we just have to do the best we can - > such as a prominent notice alerting that if you transition JVM's between > building and searching the index and you are using or doing X, things will > break. > We should put this in a spot that is always pretty visible - perhaps even a > new readme file titlted something like IndexBackwardCompatibility or > something, to which we can add other tips and gotchyas as they come up. Or > MaintainingIndicesAcrossVersions, or > FancyWhateverGetsYourAttentionAboutUpgradingStuff. Or a permanent > entry/sticky entry at the top of Changes. > {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2073) Document issues involved in building your index with one jdk version and then searching/updating with another
[ https://issues.apache.org/jira/browse/LUCENE-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779006#action_12779006 ] Marvin Humphrey commented on LUCENE-2073: - > are you sure? StandardAnalyzer uses LowerCaseFilter, No, I'm not sure. :( I was confusing StandardAnalyzer and StandardTokenizer. I still think that there are a lot of people who don't need to reindex, because, for example, their entire corpus is limited to Latin-1 code points. Conversely, the people most likely to be affected are the people most likely to be on the lookout for this kind of think. I think it's important to reach this group, without unduly alarming those who don't really need to reindex. Reindexing is a huge pain for some installations. > Document issues involved in building your index with one jdk version and then > searching/updating with another > - > > Key: LUCENE-2073 > URL: https://issues.apache.org/jira/browse/LUCENE-2073 > Project: Lucene - Java > Issue Type: Task >Reporter: Mark Miller > Attachments: LUCENE-2073.patch, LUCENE-2073.patch > > > I think this needs to go in something of a permenant spot - this isn't a one > time release type issues - its going to present over multiple release. > {quote} > If there is nothing we can do here, then we just have to do the best we can - > such as a prominent notice alerting that if you transition JVM's between > building and searching the index and you are using or doing X, things will > break. > We should put this in a spot that is always pretty visible - perhaps even a > new readme file titlted something like IndexBackwardCompatibility or > something, to which we can add other tips and gotchyas as they come up. Or > MaintainingIndicesAcrossVersions, or > FancyWhateverGetsYourAttentionAboutUpgradingStuff. Or a permanent > entry/sticky entry at the top of Changes. > {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2073) Document issues involved in building your index with one jdk version and then searching/updating with another
[ https://issues.apache.org/jira/browse/LUCENE-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778998#action_12778998 ] Marvin Humphrey commented on LUCENE-2073: - I like this: > some parts of Lucene ... but I still think the message is a little too aggressive. There are a lot of people just using ye olde StandardAnalyzer, and they don't need to reindex. We don't need to spread our own FUD. :) Can we change it to say "Analyzers", and then refer people to the docs for their specific Analyzer? Alternatively, should that notification just contain a complete list of the affected classes? > Document issues involved in building your index with one jdk version and then > searching/updating with another > - > > Key: LUCENE-2073 > URL: https://issues.apache.org/jira/browse/LUCENE-2073 > Project: Lucene - Java > Issue Type: Task >Reporter: Mark Miller > Attachments: LUCENE-2073.patch, LUCENE-2073.patch > > > I think this needs to go in something of a permenant spot - this isn't a one > time release type issues - its going to present over multiple release. > {quote} > If there is nothing we can do here, then we just have to do the best we can - > such as a prominent notice alerting that if you transition JVM's between > building and searching the index and you are using or doing X, things will > break. > We should put this in a spot that is always pretty visible - perhaps even a > new readme file titlted something like IndexBackwardCompatibility or > something, to which we can add other tips and gotchyas as they come up. Or > MaintainingIndicesAcrossVersions, or > FancyWhateverGetsYourAttentionAboutUpgradingStuff. Or a permanent > entry/sticky entry at the top of Changes. > {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2073) Document issues involved in building your index with one jdk version and then searching/updating with another
[ https://issues.apache.org/jira/browse/LUCENE-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778875#action_12778875 ] Marvin Humphrey commented on LUCENE-2073: - Which components are affected by this? I think just Analyzers and query parsers, yes? If that's true, my inclination would be to add a note to the javadocs for each such class. In every case, it's theoretically possible to build alternative implementations which are unaffected by upgrading the JVM. This isn't a fundamental problem with the Lucene architecture; it's an artifact of the way certain classes are implemented. Outside of the affected components, Lucene doesn't get down and dirty with Unicode properties and other fast-moving stuff -- it's just dealing in UTF-8 bytes, Java strings, etc. Those things can change (Modified UTF-8, shudder), but they move on a slower timescale. Arguably, Analyzer subclasses shouldn't be in core for reasons like this. Perhaps there could be an "ICUAnalysis" package which depends on ICU4J, so that Unicode-related index incompatibilites occur when you upgrade your Unicode library. Though most people would probably choose to use the smaller-footprint, zero-dependency "JVMAnalysis" package, where reindexing would be required after a JVM upgrade. The software certifiers wouldn't like that, and I'm not seriously advocating such a disruptive change (yet), but I just wanted to illustrate that this is a contained problem. > Document issues involved in building your index with one jdk version and then > searching/updating with another > - > > Key: LUCENE-2073 > URL: https://issues.apache.org/jira/browse/LUCENE-2073 > Project: Lucene - Java > Issue Type: Task >Reporter: Mark Miller > > I think this needs to go in something of a permenant spot - this isn't a one > time release type issues - its going to present over multiple release. > {quote} > If there is nothing we can do here, then we just have to do the best we can - > such as a prominent notice alerting that if you transition JVM's between > building and searching the index and you are using or doing X, things will > break. > We should put this in a spot that is always pretty visible - perhaps even a > new readme file titlted something like IndexBackwardCompatibility or > something, to which we can add other tips and gotchyas as they come up. Or > MaintainingIndicesAcrossVersions, or > FancyWhateverGetsYourAttentionAboutUpgradingStuff. Or a permanent > entry/sticky entry at the top of Changes. > {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API
[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771201#action_12771201 ] Marvin Humphrey commented on LUCENE-1997: - > What kind of comparator can't pre-create a fixed ordinal list for all the > possible values? I'm sure I've seen this too, but I can't bring one to mind > right now. I think the only time the ordinal list can't be created is when the source array contains some value that can't be compared against another value -- e.g. some variant on NULL -- or when the comparison function is broken, e.g. when a < b and b < c but c > a. For current KinoSearch and future Lucy, we pre-build the ord array at index time and mmap it at search time. (Thanks to mmap, sort caches have virtually no impact on IndexReader launch time.) > Explore performance of multi-PQ vs single-PQ sorting API > > > Key: LUCENE-1997 > URL: https://issues.apache.org/jira/browse/LUCENE-1997 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless > Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, > LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, > LUCENE-1997.patch > > > Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev, > where a simpler (non-segment-based) comparator API is proposed that > gathers results into multiple PQs (one per segment) and then merges > them in the end. > I started from John's multi-PQ code and worked it into > contrib/benchmark so that we could run perf tests. Then I generified > the Python script I use for running search benchmarks (in > contrib/benchmark/sortBench.py). > The script first creates indexes with 1M docs (based on > SortableSingleDocSource, and based on wikipedia, if available). Then > it runs various combinations: > * Index with 20 balanced segments vs index with the "normal" log > segment size > * Queries with different numbers of hits (only for wikipedia index) > * Different top N > * Different sorts (by title, for wikipedia, and by random string, > random int, and country for the random index) > For each test, 7 search rounds are run and the best QPS is kept. The > script runs singlePQ then multiPQ, and records the resulting best QPS > for each and produces table (in Jira format) as output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 and deprecated IR.open() methods
On Mon, Oct 05, 2009 at 08:27:20AM +0200, Uwe Schindler wrote: > Pass a Properties or Map to the ctor/open. The keys are predefined > constants. Maybe our previous idea of an IndexConfiguration class is a > subclass of HashMap with all the constants and some easy-to-use > setter methods for very often-used settings (like index dir) and some > reasonable defaults. Interesting. The design we worked out for Lucy's Segment class (prototype in KS devel branch) uses hash/array/string data to store arbitrary metadata on behalf of segment components, written as JSON to seg_NNN/segmeta.json. In that case, though, each component is responsible for generating and consuming its own data. That's different from having the user supply data via such a format. I still think you're going to want an extensible builder class. > This allows us to pass these properties to any flex indexing component > without need to modify/extend it to support the additional properties. The > flexible indexing component just defines its own property names (e.g. as > URNs, URLs, using its class name as prefix,...). But how do you determine what the flex indexing components *are*? In theory, you can pass class names and sufficient arguments to build them up via your big ball of data, but then you're essentially creating a new language, with all the headaches that entails. In KS, Indexer/IndexReader configuration is divided between three classes. * Schema: field definitions. * Architecture: Settings that never change for the life of the index. * IndexManager: Settings that can change per index/search session. Schema isn't worth discussing -- Lucy will have it, Lucene won't, end of story. Architecture and IndexManager, though, are fairly close to what's being discussed. Architecture is responsible for e.g. determining which plugabble components get registered. It's the builder class. IndexManager is where things like merging and locking policies reside. > Property names are always String, values any type (therefore Map). > With Java 5, integer props and so on are no "bad syntax" problem because of > autoboxing (no need to pass new Integer() or Integer.valueOf()). Argument validation gets to be a headache when you pass around complex data structures. It's doable, but messy and hard to grok. Going through dedicated methods is cleaner and safer. > Another good thing is, that implementors of e.g. XML config files like in > Solr, can simple pass all elements in config to this map. I go back and forth on this. At some point, the volume of data becomes overwhelming and it becomes easier to swap in the name of a class where most of the data can reside in nice, reliable, structured code. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 and deprecated IR.open() methods
On Sun, Oct 04, 2009 at 05:53:14AM -0400, Michael McCandless wrote: > 1 Do we prevent config settings from changing after creating an > IW/IR? Any settings conveyed via a settings object ought to be final if you want pluggable index components. Otherwise, you need some nightmarish notification system to propagate settings down into your subcomponents, which may or may not be prepared to handle the value modifications. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 and deprecated IR.open() methods
On Sun, Oct 04, 2009 at 03:04:13PM -0400, Mark Miller wrote: > Earwin Burrfoot wrote: > > As I stated in my last email, there's zero difference between > > settings+static factory and builder except for syntax. Cannot > > understand what Mark, Mike are arguing about. > > > Sounds like we are arguing that we don't like the syntax then... So, implement the static factory methods as wrappers around the builder method. public static IndexWriter open(Directory dir, Analyzer analyzer) { return open(new IndexManager(dir), dir, analyzer) } public static IndexWriter open(IndexManager manager, Directory dir, Analyzer analyzer) { return arch.buildIndexWriter(new Architecture(), manager, dir, analyzer); } public static IndexWriter open(Architecture arch, IndexManager manager, Directory dir, Analyzer analyzer) { return arch.buildIndexWriter(manager, dir, analyzer); } IMO, it's important not to force first-time users to grok builder classes in order to perform basic indexing or searching. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: custom segment files
On Fri, Sep 18, 2009 at 08:14:24AM +0800, John Wang wrote: > Say you have a type of field with fixed length data per doc, e.g. a 8 bytes. > It might be good to store in a segment: > Heh. You've just described this proof of concept class: http://www.rectangular.com/kinosearch/docs/devel/KSx/Index/ByteBufDocWriter.html http://www.rectangular.com/svn/kinosearch/trunk/perl/lib/KSx/Index/ByteBufDocWriter.pm > Hopefully I am describing it clearly. Sure, I understand exactly what you mean. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect
[ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755062#action_12755062 ] Marvin Humphrey commented on LUCENE-1908: - The rationale behind the coarseness of the norms is that since the accuracy of search engines in retrieving the documents that the user really wants is so poor, only big differences matter. (It's not just poor "recall" against a given query, but the difficulty that the user experiences in formulating a proper query to express what they're really looking for in the first place.) Doug wrote at least once about this some years back, but I haven't been able to track down the post. > Similarity javadocs for scoring function to relate more tightly to scoring > models in effect > --- > > Key: LUCENE-1908 > URL: https://issues.apache.org/jira/browse/LUCENE-1908 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Doron Cohen >Assignee: Doron Cohen >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1908.patch, LUCENE-1908.patch, LUCENE-1908.patch, > LUCENE-1908.patch > > > See discussion in the related issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1900) Confusing Javadoc in Searchable.java
[ https://issues.apache.org/jira/browse/LUCENE-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752675#action_12752675 ] Marvin Humphrey commented on LUCENE-1900: - IMO, maxDoc(), docFreq(), and docFreqs() are all expert, because they all require an understanding of the deletions mechanism to grasp their behavior. I'd vote for adding the "expert" tag to IndexReader.maxDoc() before stripping it from those. > Confusing Javadoc in Searchable.java > > > Key: LUCENE-1900 > URL: https://issues.apache.org/jira/browse/LUCENE-1900 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.9 >Reporter: Nadav Har'El >Assignee: Mark Miller >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1900.patch > > > In Searchable.java, the javadoc for maxdoc() is: > /** Expert: Returns one greater than the largest possible document number. >* Called by search code to compute term weights. >* @see org.apache.lucene.index.IndexReader#maxDoc() > The qualification "expert" and the statement "called by search code to > compute term weights" is a bit confusing, It implies that maxdoc() somehow > computes weights, which is obviously not true (what it does is explained in > the other sentence). Maybe it is used as one factor of the weight, but do we > really need to mention this here? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1900) Confusing Javadoc in Searchable.java
[ https://issues.apache.org/jira/browse/LUCENE-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752612#action_12752612 ] Marvin Humphrey commented on LUCENE-1900: - maxDoc() isn't just used for calculating weights. It's also used for e.g. figuring out how big your bit vector needs to be in order to accommodate the largest doc in the collection. My vote would be to just strip that extra comment about calculating term weights. > Confusing Javadoc in Searchable.java > > > Key: LUCENE-1900 > URL: https://issues.apache.org/jira/browse/LUCENE-1900 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.9 >Reporter: Nadav Har'El >Priority: Trivial > > In Searchable.java, the javadoc for maxdoc() is: > /** Expert: Returns one greater than the largest possible document number. >* Called by search code to compute term weights. >* @see org.apache.lucene.index.IndexReader#maxDoc() > The qualification "expert" and the statement "called by search code to > compute term weights" is a bit confusing, It implies that maxdoc() somehow > computes weights, which is obviously not true (what it does is explained in > the other sentence). Maybe it is used as one factor of the weight, but do we > really need to mention this here? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1896) Modify confusing javadoc for queryNorm
[ https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752591#action_12752591 ] Marvin Humphrey commented on LUCENE-1896: - > at what I am trusting is essentially no cost. Here's the snippet from TermQuery.score() where queryNorm() actually gets applied to each document's score: {code} float raw = // compute tf(f)*weight f < SCORE_CACHE_SIZE// check cache ? scoreCache[f] // cache hit : getSimilarity().tf(f)*weightValue;// cache miss {code} At this point, queryNorm() has already been factored into weightValue (and scoreCache). It happens during setup. You can either scale weightValue by queryNorm() during setup or not -- the per-document computational cost is unaffected. > Modify confusing javadoc for queryNorm > -- > > Key: LUCENE-1896 > URL: https://issues.apache.org/jira/browse/LUCENE-1896 > Project: Lucene - Java > Issue Type: Improvement > Components: Javadocs >Reporter: Jiri Kuhn >Priority: Minor > Fix For: 2.9 > > > See http://markmail.org/message/arai6silfiktwcer > The javadoc confuses me as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1896) Modify confusing javadoc for queryNorm
[ https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752311#action_12752311 ] Marvin Humphrey commented on LUCENE-1896: - FWIW, after all that [fuss|http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200802.mbox/%3c9396e8e7-46ff-4b78-9427-13e9a7e58...@rectangular.com%3e], I would up leaving it in. >From the standpoint of ordinary users, queryNorm() is harmless or mildly beneficial. Scores are never going to be comparable across multiple queries without what _I_ normally think of as "normalization" (given my background in audio): setting the top score to 1.0, and multiplying all other scores by the same factor. Nevertheless, it's better for them to be closer together than farther apart. >From the standpoint of users trying to write Query subclasses, it's a wash. On the one hand, it's not the most important method, since it doesn't affect ranking within a single query -- and zapping it would mean one less thing to think about. On the other hand, it's nice to have it in there for the sake of completeness in the implementation of cosine similarity. I eventually would up messing with *how* the query norm gets applied to achieve my de-voodoo-fication goals. Essentially, I hid away queryNorm() so that you don't need to think about it unless you really need it. > Modify confusing javadoc for queryNorm > -- > > Key: LUCENE-1896 > URL: https://issues.apache.org/jira/browse/LUCENE-1896 > Project: Lucene - Java > Issue Type: Improvement > Components: Javadocs >Reporter: Jiri Kuhn >Priority: Minor > Fix For: 2.9 > > > See http://markmail.org/message/arai6silfiktwcer > The javadoc confuses me as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1877) Improve IndexWriter javadoc on locking
[ https://issues.apache.org/jira/browse/LUCENE-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749363#action_12749363 ] Marvin Humphrey commented on LUCENE-1877: - > I can see how this is not ideal, but I'm not seeing how any of the > mentioned issues apply to our simple lock usage ... "Simple lock usage"?! You must have a bigger brain than me... As a matter of fact, I think you're right. Fcntl locks have two major drawbacks, and upon review I think NativeFSLockFactory avoids both of them. The first is that opening and closing a file releases all locks for the entire process. Even if you never request a lock on the second filehandle, or if you request a lock and the request is denied, closing the filehandle releases the lock on the first filehandle. But NativeFSLockFactory avoids that problem by keeping the HashSet of filepaths and ensuring that the same file is never opened twice. Furthermore, since the lockfiles are private to Lucene, you can assume that nobody else is going to open them and inadvertently spoil the lock. The second is that child processes spawned via fork() do not inherit locks from the parent process. If you assume that nobody's ever going to fork a Java process, that's not relevant. (Too bad that won't work for Lucy... we have to support fork().) I think you're probably safe with Fcntl locks on all non-shared volumes. > Improve IndexWriter javadoc on locking > -- > > Key: LUCENE-1877 > URL: https://issues.apache.org/jira/browse/LUCENE-1877 > Project: Lucene - Java > Issue Type: Improvement > Components: Javadocs >Reporter: Mark Miller >Priority: Trivial > Fix For: 2.9 > > > A user requested we add a note in IndexWriter alerting the availability of > NativeFSLockFactory (allowing you to avoid retaining locks on abnormal jvm > exit). Seems reasonable to me - we want users to be able to easily stumble > upon this class. The below code looks like a good spot to add a note - could > also improve whats there a bit - opening an IndexWriter does not necessarily > create a lock file - that would depend on the LockFactory used. > {code} Opening an IndexWriter creates a lock file for the > directory in use. Trying to open > another IndexWriter on the same directory will lead to a > {...@link LockObtainFailedException}. The {...@link > LockObtainFailedException} > is also thrown if an IndexReader on the same directory is used to delete > documents > from the index.{code} > Anyone remember why NativeFSLockFactory is not the default over > SimpleFSLockFactory? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1877) Improve IndexWriter javadoc on locking
[ https://issues.apache.org/jira/browse/LUCENE-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749330#action_12749330 ] Marvin Humphrey commented on LUCENE-1877: - > Anyone remember why NativeFSLockFactory is not the default over > SimpleFSLockFactory? Wasn't it because native locking is somethings implemented with Fcntl, and Fcntl locking blows chunks? Especially for a library rather than an application. >From the BSD manpage on Fcntl: {quote} This interface follows the completely stupid semantics of System V and IEEE Std 1003.1-1988 (``POSIX.1'') that require that all locks associated with a file for a given process are removed when any file descriptor for that file is closed by that process. This semantic means that applications must be aware of any files that a subroutine library may access. For example if an application for updating the password file locks the password file database while making the update, and then calls getpwname(3) to retrieve a record, the lock will be lost because getpwname(3) opens, reads, and closes the password database. The database close will release all locks that the process has associated with the database, even if the library routine never requested a lock on the database. Another minor semantic problem with this interface is that locks are not inherited by a child process created using the fork(2) function. The flock(2) interface has much more rational last close semantics and allows locks to be inherited by child processes. Flock(2) is recommended for applications that want to ensure the integrity of their locks when using library routines or wish to pass locks to their children... {quote} The lockfile may be annoying, but at least it's guaranteed safe on all non-shared volumes when the OS implements atomic file opening. Are you folks at least able to clean up orphaned lockfiles if the PID it was created under is no longer active? > Improve IndexWriter javadoc on locking > -- > > Key: LUCENE-1877 > URL: https://issues.apache.org/jira/browse/LUCENE-1877 > Project: Lucene - Java > Issue Type: Improvement > Components: Javadocs >Reporter: Mark Miller >Priority: Trivial > Fix For: 2.9 > > > A user requested we add a note in IndexWriter alerting the availability of > NativeFSLockFactory (allowing you to avoid retaining locks on abnormal jvm > exit). Seems reasonable to me - we want users to be able to easily stumble > upon this class. The below code looks like a good spot to add a note - could > also improve whats there a bit - opening an IndexWriter does not necessarily > create a lock file - that would depend on the LockFactory used. > {code} Opening an IndexWriter creates a lock file for the > directory in use. Trying to open > another IndexWriter on the same directory will lead to a > {...@link LockObtainFailedException}. The {...@link > LockObtainFailedException} > is also thrown if an IndexReader on the same directory is used to delete > documents > from the index.{code} > Anyone remember why NativeFSLockFactory is not the default over > SimpleFSLockFactory? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big
[ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748109#action_12748109 ] Marvin Humphrey commented on LUCENE-1859: - > I don't believe there is ever any valid argument against adding > documentation. The more that documentation grows, the harder it is to absorb. The more bells and whistles on an API, the harder it is to grok and to use effectively. The more a code base bloats, the harder it is to maintain or to evolve. > keeping average memory usage down prevents those wonderful OutOfMemory > Exceptions No, it won't. If someone is emitting large tokens regularly, it is likely that several threads will require large RAM footprints simultaneously, and an OOM will occur. That would be the common case. If someone is emmitting large tokens periodically, well, this doesn't prevent the OOM, it just makes it less likely. That's not worthless, but it's not something anybody should count on when assessing required RAM usage. Keeping average memory usage down is good for the system at large. If this is implemented, that should be the justification. > TermAttributeImpl's buffer will never "shrink" if it grows too big > -- > > Key: LUCENE-1859 > URL: https://issues.apache.org/jira/browse/LUCENE-1859 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.9 >Reporter: Tim Smith >Priority: Minor > > This was also an issue with Token previously as well > If a TermAttributeImpl is populated with a very long buffer, it will never be > able to reclaim this memory > Obviously, it can be argued that Tokenizer's should never emit "large" > tokens, however it seems that the TermAttributeImpl should have a reasonable > static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, > it will shrink back down to this size once the next token smaller than > MAX_BUFFER_SIZE is set > I don't think i have actually encountered issues with this yet, however it > seems like if you have multiple indexing threads, you could end up with a > char[Integer.MAX_VALUE] per thread (in the very worst case scenario) > perhaps growTermBuffer should have the logic to shrink if the buffer is > currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big
[ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748102#action_12748102 ] Marvin Humphrey commented on LUCENE-1859: - > i fail to see the complexity of adding one method to TermAttribute: Death by a thousand cuts. This is one cut. I wouldn't even add the note to the documentation. If you emit large tokens, you have to plan for obscene peak memory usage anyway, and if you're not prepared for that, you deserve what you get. Keeping the average down doesn't help that. The only reason to do this is to keep average memory usage down for the hell of it, and if it goes in, it should be an implementation detail. > TermAttributeImpl's buffer will never "shrink" if it grows too big > -- > > Key: LUCENE-1859 > URL: https://issues.apache.org/jira/browse/LUCENE-1859 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.9 >Reporter: Tim Smith >Priority: Minor > > This was also an issue with Token previously as well > If a TermAttributeImpl is populated with a very long buffer, it will never be > able to reclaim this memory > Obviously, it can be argued that Tokenizer's should never emit "large" > tokens, however it seems that the TermAttributeImpl should have a reasonable > static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, > it will shrink back down to this size once the next token smaller than > MAX_BUFFER_SIZE is set > I don't think i have actually encountered issues with this yet, however it > seems like if you have multiple indexing threads, you could end up with a > char[Integer.MAX_VALUE] per thread (in the very worst case scenario) > perhaps growTermBuffer should have the logic to shrink if the buffer is > currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big
[ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748089#action_12748089 ] Marvin Humphrey commented on LUCENE-1859: - IMO, the benefit of adding these theoretical helper methods to lower average -- but not peak -- memory usage by non-core Tokenizers which are probably doing something impractical anyway... does not justify the complexity cost. > TermAttributeImpl's buffer will never "shrink" if it grows too big > -- > > Key: LUCENE-1859 > URL: https://issues.apache.org/jira/browse/LUCENE-1859 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.9 >Reporter: Tim Smith >Priority: Minor > > This was also an issue with Token previously as well > If a TermAttributeImpl is populated with a very long buffer, it will never be > able to reclaim this memory > Obviously, it can be argued that Tokenizer's should never emit "large" > tokens, however it seems that the TermAttributeImpl should have a reasonable > static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, > it will shrink back down to this size once the next token smaller than > MAX_BUFFER_SIZE is set > I don't think i have actually encountered issues with this yet, however it > seems like if you have multiple indexing threads, you could end up with a > char[Integer.MAX_VALUE] per thread (in the very worst case scenario) > perhaps growTermBuffer should have the logic to shrink if the buffer is > currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big
[ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748064#action_12748064 ] Marvin Humphrey commented on LUCENE-1859: - The worst-case scenario seems kind of theoretical, since there are so many reasons that huge tokens are impractical. (Is a priority of "major" justified?) If there's a significant benefit to shrinking the allocation, it's minimizing average memory usage over time. But even that assumes a nearly pathological distribution in field size -- it would have to be large for early documents, then consistently small for subsequent documents. If it's scattered, you have to plan for worst case RAM usage as an app developer, anyway. Which generally means limiting token size. I assume that, based on this report, TermAttributeImpl never gets reset or discarded/recreated over the course of an indexing session? -0 if the reallocation happens no more often than once per document. -1 if it the reallocation has be performed in an inner loop. > TermAttributeImpl's buffer will never "shrink" if it grows too big > -- > > Key: LUCENE-1859 > URL: https://issues.apache.org/jira/browse/LUCENE-1859 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.9 >Reporter: Tim Smith > > This was also an issue with Token previously as well > If a TermAttributeImpl is populated with a very long buffer, it will never be > able to reclaim this memory > Obviously, it can be argued that Tokenizer's should never emit "large" > tokens, however it seems that the TermAttributeImpl should have a reasonable > static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, > it will shrink back down to this size once the next token smaller than > MAX_BUFFER_SIZE is set > I don't think i have actually encountered issues with this yet, however it > seems like if you have multiple indexing threads, you could end up with a > char[Integer.MAX_VALUE] per thread (in the very worst case scenario) > perhaps growTermBuffer should have the logic to shrink if the buffer is > currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Finishing Lucene 2.9
On Mon, Aug 24, 2009 at 10:15:20PM +0300, Shai Erera wrote: > I think it all boils down to this jar drop-in ability. Expecting jar drop-in compatibility for bugfix releases is 100% reasonable. Expecting something close to jar drop-in compatibility for minor releases seems pretty reasonable, too. Expecting jar drop-in compatibility minus deprecations at a major version change is only reasonable when that has been made the explicit public policy of the project. By making that promise, you are squandering your one opportunity to make disruptive changes. Instead, you're trying to shoehorn what ought to be disruptive changes into an artificially continuous release cycle. It's a lot of work, results in a lot of inelegant compatibility APIs, and seems not to have been successfully implemented yet for 2.9. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Finishing Lucene 2.9
On Mon, Aug 24, 2009 at 01:46:35PM -0400, Michael McCandless wrote: > Right, that is and has been the "plan" for 2.9/3.0/3.1 for quite some time. > > We are now discussing whether to change the plan, but so far it looks > likely we'll just stick with it... It seems like breaking the promise would be disruptive now. But you have an opportunity to change the policy at 3.0, affecting 3.9 and 4.0. That's a 3.0 issue, though -- not a 2.9 issue. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Finishing Lucene 2.9
On Mon, Aug 24, 2009 at 11:44:17AM -0400, Michael McCandless wrote: > Separately, we can think about having 3.1 be a "real" release, not > just a "fast turnaround" release. All problems flow from this "fast turnaround release" constraint. If you had the freedom to make the kind of API changes people normally expect to accompany a major version change, everything would be a lot simpler. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1684) Add matchVersion to StandardAnalyzer
[ https://issues.apache.org/jira/browse/LUCENE-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718485#action_12718485 ] Marvin Humphrey commented on LUCENE-1684: - +1 This approach addresses all of my concerns about action-at-a-distance behaviors. Nice work, Mike. > Add matchVersion to StandardAnalyzer > > > Key: LUCENE-1684 > URL: https://issues.apache.org/jira/browse/LUCENE-1684 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1684.patch > > > I think we should add a matchVersion arg to StandardAnalyzer. This > allows us to fix bugs (for new users) while keeping precise back > compat (for users who upgrade). > We've discussed this on java-dev, but I'd like to now make it concrete > (patch attached). I think it actually works very well, and is a > simple tool to help us carry out our back-compat policy. > I coded up an example with StandardAnalyzer: > * The ctor now takes a required arg (Version matchVersion). You > pass Version.LUCENE_CURRENT to always get lates & greatest, or eg > Version.LUCENE_24 to match 2.4's bugs/settings/behavior. > * StandardAalyzer conditionalizes the "replace invalid acronym" and > "enable position increment in StopFilter" based on matchVersion. > * It also prevents creating zillions of ctors, over time, as we need > to change settings in the class. EG StandardAnalyzer now has 2 > settings that are version dependent, and there's at least another > 2 issues open on fixing some more of its bugs. > The migration is also very clean: we'd only add this to classes on an > "as needed" basis. On the first release that adds the arg, the > default remains back compatible with the prior release. Then, going > forward, we are free to fix issues on that class and conditionalize by > matchVersion. > The javadoc at the top of StandardAnalyzer clearly calls out what > version specific behavior is done: > {code} > * You must specify the required {...@link Version} > * compatibility when creating StandardAnalyzer: > * > *As of 2.9, StopFilter preserves position > *increments by default > *As of 2.9, Tokens incorrectly idenfied as acronyms > *are corrected (see href="https://issues.apache.org/jira/browse/LUCENE-1068";>LUCENE-1608 > * > * > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
On Fri, May 22, 2009 at 10:40:03PM +0400, Earwin Burrfoot wrote: > >> Custom analyzers. > > No problem. > How are they recorded in the index? Analyzers must implement dump() and load(), which convert the Analyzer to/from a JSON-izable data structure. They end up as JSON in index_dir/schema_NNN.json. Custom subclasses must be loaded by whatever app wants to read the index, naturally. > >> Intentionally different analyzers for indexing and searching. > > No problem. That only makes sense in the context of QueryParser, and the KS > > QueryParser allows you to supply an analyzer which overrides the Schema. > But well, it differs from analyzer used for indexation in one or two > options, and shares a heap of others. A constructor argument solves that problem, doesn't it? Am I missing something? > >> Using this analyzer without any index at all - like I do highlight on > >> a separate machine to minimize GC pauses, or tag docs by running a > >> heap of queries against MemoryIndex. > > No problem. Distribute a Schema subclass among several machines. > You mean read an index on one machine, create Analyzer, serialize it > and send over the wire to other machines? I hope that's either a joke > or I misunderstood you. Please. How did your Analyzer class get on the other machines? Do the same thing with your Schema subclass. > Storing a list of stopwords in the index sounds fun. Storing a fat > synonym/morphology dictionary while completely analogous, is no longer > fun. So, don't store that whole dictionary in the serialized Analyzer -- just store a version number. Make the synonym data class data. If it's reasonable to key multiple versions of the class data off of the version number constructor argument, do that. If not and an index was built with an version of the Analyzer that is no longer supported, either throw an exception or intentionally ignore the mismatch and serve screwed up search results. Your call. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
On Fri, May 22, 2009 at 09:06:32PM +0400, Earwin Burrfoot wrote: > > In KinoSearch SVN trunk, satellite classes like QueryParser and Highlighter > > have to be passed a Schema, which contains all the Analyzers. Analyzers > > aren't satellite classes under this model -- they are a fixed property of a > > FullTextType field spec. Think of them as baked into an SQL field > > definition. > > > > You can create a Schema from scratch to pass to the QueryParser, but it's > > easier to just get it from the Searcher. Translating to Java... > > > > Searcher searcher = new Searcher("/path/to/index"); > > QueryParser qparser = new QueryParser(searcher.getSchema()); > > > > I don't see how that's so different from getting an analyzer actsAsVersion > > number from the index. > > > > Now, where stuff might start to get complicated is > > PerFieldAnalyzerWrapper... > > is that where the sneakiness gets overwhelming? > Some people can have setups more complex than that. > Different analyzers per field. Heh. One of the primary rationales behind Schema was to tie individual analyzers to specific fields. > Custom analyzers. No problem. > Several indexes using the same analyzer. No problem. Only necessary if the analyzer is costly or has some esoteric need for shared state. And possible via subclassing Schema or Analyzer. > Intentionally different analyzers for indexing and searching. No problem. That only makes sense in the context of QueryParser, and the KS QueryParser allows you to supply an analyzer which overrides the Schema. > Using this analyzer without any index at all - like I do highlight on > a separate machine to minimize GC pauses, or tag docs by running a > heap of queries against MemoryIndex. No problem. Distribute a Schema subclass among several machines. These are all solved problems under the per-index field semantics serialized Schema model. That's why I said it was the "theoretical solution". Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
On Fri, May 22, 2009 at 01:22:24PM -0400, Michael McCandless wrote: > > Sounds like an argument for more frequent major releases. > > Yeah. Or "rebranding" what we now call minor as major releases, by > changing our policy ;) Not sure how much of that is a jest, bug I don't think that's a good idea. It violates commonly held expectations about what constitutes a "minor release". Of course, I'm not sure to what extent modified interfaces will surprise people. At least that's compile-time... but then it will make it harder for multiple apps with Lucene depenencies to coexist. > Will Lucy do scoring when sorting by field, by default? Nope. Why would we do that? The only reason you're doing it in Lucene is to preserve back compat, and Lucy doesn't have that constraint. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
> I feel the opposite: I'd like new users to see improvements by > default, and users that require strict back-compate to ask for that. By "strict back-compat", do you mean "people who would like their search app to not fail silently"? ;) A "new user" who follows your advice... // haha stupid noob StandardAnalyzer analyzer = new StandardAnalyzer(Versons.LATEST); ... is going to get screwed when the default tokenization behavior changes. And it would be much worse if we follow my preference for making the arg optional without following my preference for keeping defaults intact: // haha eat it luser StandardAnalyzer analyzer = new StandardAnalyzer(); It's either make the arg mandatory when changing default behavior and recommend that new users pass a fixed argument, or make it optional but keep defaults intact between major releases. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
On Fri, May 22, 2009 at 11:33:33AM -0400, Michael McCandless wrote: > when working on 3.1 if we make some great improvement, I'd like new users in > 3.1 to see the improvement by default. Sounds like an argument for more frequent major releases. But I'm not exactly one to talk. ;) > On thinking about it more... automagically storing the "actsAsVersion" > in the index, and then having IndexWriter (for example) ask the > analyzer for a tokenStream matching that version, seems a little too > sneaky. Can you elaborate? In KinoSearch SVN trunk, satellite classes like QueryParser and Highlighter have to be passed a Schema, which contains all the Analyzers. Analyzers aren't satellite classes under this model -- they are a fixed property of a FullTextType field spec. Think of them as baked into an SQL field definition. You can create a Schema from scratch to pass to the QueryParser, but it's easier to just get it from the Searcher. Translating to Java... Searcher searcher = new Searcher("/path/to/index"); QueryParser qparser = new QueryParser(searcher.getSchema()); I don't see how that's so different from getting an analyzer actsAsVersion number from the index. Now, where stuff might start to get complicated is PerFieldAnalyzerWrapper... is that where the sneakiness gets overwhelming? > I prefer the up-front "you specify actsAsVersion" when you > create the analyzer, only for analyzers that have changed across > releases. So things like WhitespaceAnalyzer would likely never need > an actsAsVersion arg. Hmm, this is kind of hard. I'd prefer that the argument remain optional, so that new users don't have to think about it. But unlike in KS/Lucy, then there's a danger of leaving it off inadvertently and getting the wrong behavior. :\ Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
On Fri, May 22, 2009 at 11:53:02AM -0400, Michael McCandless wrote: > 1. If we deprecate an API in the 2.1 release, we can remove it in > the next minor release (2.2). > > 2. JAR drop-in-ability is only guaranteed on point releases (2.4.1 > is a drop-in replacement to 2.4.0). When switching to a new > minor release (2.1 -> 2.2) likely you'll need to recompile. > 4. [Maybe?] Allow certain limited changes that will require source > code changes in your app on upgrading to a new minor release: > adding a new method to an interface, adding a new abstract method > to an abstract class, renaming of deprecated methods. These make sense to me. Catastrophic failure at compile time is vastly easier to deal with than subtle failure at run time. > 3. Default settings can change, but if the change is big enough (and > certainly if it will impact what's indexed or how searches find > docs/do scoring), we add a required "actsAsVersion" arg to the > ctor of the affected class. New users get the latest & greatest, > and upgraded users keep their old defaults. I still like per-class settings classes. For instance, an IndexWriterSettings class which allows you to hide away all the tweaky stuff that's cluttering up the IndexWriter API. IndexWriterSettings settings = new IndexWriterSettings("3.1"); IndexWriter writer = new IndexWriter("path/to/index", analyzer, settings); I also think that the argument should be optional rather than mandatory, and that defaults should remain stable between major releases. In other words, to take advantage of improved defaults, you need to ask for them -- but new users don't have to think about such things during the initial learning phase. This approach is reasonably close to how Architecture and IndexManager are used to hide away settings for the KS/Lucy Indexer class. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
On Thu, May 21, 2009 at 05:19:43PM -0400, Michael McCandless wrote: > Marvin, which solution would you prefer? Between the two, I'd prefer settings constructor arguments, though I would be inclined to have settings classes that are specific to individual classes rather than Lucene-wide. At least that scheme gets locality right. The global actsAsVersion variable violates that principle and has the potential to saddle a small number of users who have done absolutely nothing wrong with bugs that are very, very hard to hunt down. That's unfair. As far as analyzers and token streams, the theoretical answer is making indexes self-describing via serializable schemas, as discussed on the Lucy dev list, and as implemented in KinoSearch svn trunk. With versioning metadata attached to the index, there is no longer any worry about upgrading analysis modules provided that those modules handle their own versioning correctly. For instance, in KS the Stopalizer always embeds the complete stoplist in the schema file, so even if we update the "English" stoplist, we don't get invalid search results for indexes which were created with the old stoplist. Similarly, it may not be possible to keep around multiple variants of Snowball, but at least we can fail catastrophically instead of subtly if we detect that the Snowball version has changed. Full-on schema serialization isn't feasible for Lucene, but attaching an actsAsVersion variable to an index and feeding that to your analyzers would be a decent start. Lastly, I think a major java Lucene release is justified already. Won't this discussion die down somewhat if you can get 3.0 out? If there are issues that are half done, how about rolling back whatever's in the way? Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
Mike McCandless: > Well this is what I love about the actsAsVersion solution. There's no > pain for our back-compat users (besides the one-time effort to set > actsAsVersion), and new users always get the best settings. When some mad-as-hell user complains to this list after spending an inordinate amount of time chasing down an action-at-a-distance bug because of this insidious and irresponsible OO design decision, I intend to follow up their email with an I-told-you-so. There's an action-at-a-distance bug in the Perl core module 'base.pm' that bedeviled people for years before I finally cornered it. Turns out it can't be fixed, but at least now we know what's happening: http://rt.cpan.org/Public/Bug/Display.html?id=28799 While this error does not occur frequently in the wild, when it does, the cost to the user is high because the debug path is obscure. I personally encountered it after failing to wrap a "use_ok" test in a BEGIN block; isolating it took me... longer than I would have liked. ;) That bug has led to 'base' having a compromised reputation among elite users because of intermittent, inexplicable flakiness. Is that what you want for Lucene? Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
On Wed, May 20, 2009 at 05:57:49PM +0400, Earwin Burrfoot wrote: > > What happens when two libraries loaded in the same VM have Lucene as a > > dependency and set actsAsVersion to conflicting numbers? > Exactly what happens when you call BooleanQuery.setMaxClauseCount(n) > from two libraries. > Last one wins. Yeesh, that's evil. :( It will be sweet, sweet justice if one of your own projects gets infected by the kind of action-at-a-distance bug you're so blithely unconcerned about. http://en.wikipedia.org/wiki/Action_at_a_distance_(computer_science) That was supposed to be a rhetorical question. To be clear, I consider the idea of a settable global variable determining library behavior completely unacceptable. Changing class load order somewhere in your code shouldn't do things like change search results (because Stopfilters are applied differently depending on who "won"). Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene's default settings & back compatibility
> But since 3.0 is a major release anyway, we could change the default > of actsAsVersion with each 3.x release (or just set it to 3) and > require that a users set actsAsVersion=3 (or whatever version they > are on) in order to get maximum back compatibility. What happens when two libraries loaded in the same VM have Lucene as a dependency and set actsAsVersion to conflicting numbers? Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org