Re: Global field semantics
Chris Hostetter wrote on 07/10/2006 12:31 PM: > So i guess we are on the same page that this kind of thing can be done at > the App level -- what benefits do you see moving them into the Lucene > index level? > Other than performance per David's and Marvin's ideas, the functionality benefits of having this in the core are probably not compelling. I've been able to hook almost everything and reference a global field model at the application level (except for QueryParser which needs some patches to enhance extensibility for some of these features). It just seemed that a global field model was so useful that it might be a beneficial extension to the core, so I was curious what others thought about this. Thanks for your thoughts, Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
: previously mentioned a very simple one: validating fields in the query : parser. More interesting examples are: This strikes me as something that can be done with an abstraction layer above and seperate from the physical index (this is in fact what Solr does) without needing to add any hard constraints on the index itself (other then those impossed by the abstraction layer) : 1. Multiple inheritance on the fields of documents that record the : sources of each inherited value to support efficient incremental maintenance I'm sorry, you completely lost me ... can you clarify what you mean? : 2. "Record-valued fields" that store facets with values (e.g., time : and user information for who set that value). These cannot easily be : broken into multiple fields because the fields in question are multi-valued. : 3. "Join fields" that reference id's of objects stored in separate : indices (supporting queries that reference the fields in the joined index) Both of these cases sound like situations where what you really want is more flexibility in the Fields/Terms that can be associated with a docId -- in the case of your "Record-valued fields" you want what I can only think of as "rich terms", hierarchical data that can be queried ... along the lines of the "FlexibleIndexing" wiki page correct? ... this doesn't seem like it would require a more concrete Field rules, but i can certianly see how an added level of abstraction might help. : Managing these kinds of rich semantic features in query parsing and : indexing is greatly facilitated by a global field model. I've built : this into my app, and then started thinking about benefits in Lucene : generally from such a model. ... So i guess we are on the same page that this kind of thing can be done at the App level -- what benefits do you see moving them into the Lucene index level? (I imagine it making the most sense as a contrib-ish auxillary API that developers can use when they don't need the full flexibility the low level API allows ... but it sounds like you think there are functional benefits to it being a first order concept in the Lucene API?) : Yes. Here is (an elaboration of) the "global model with exceptions" : idea we reached: if there can be exceptions then there can't be any hard constraints in the data store, correct? ... so an implimentation like this could be a higher level API? : > docA.add(new Field(f, "bar", Store.YES, Index.UN_TOKENIZED)): : > docA.add(new Field(f, "foo", Store.NO, Index.TOKENIZED)): : > : > docB.add(new Field(f, "x y", Store.YES, Index.TOKENIZED)): : > docB.add(new Field(f, "z", Store.NO, Index.UN_TOKENIZED)): : Hoss, do you have a use case requiring Store and Index variance like this? Not to that extreme, but i have certainly encountered situations where storing a single value while indexing multiple values was needed -- this is something Solr's schema can't handle actually, and we had to work arround it by using two fields. I've also seen situations where it would make a lot of sense to not only do that with one doc, but to also indexing a single value and storing multiple values in a different doc. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
On 7/11/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 7/10/06, David Balmain <[EMAIL PROTECTED]> wrote: > I don't think declaring all fields up front is necessary for > substantial optimizations. I've found that the key to some really good > optimizations is having constant field numbers. That is, once a field > is added to the index it is assigned a field number and it it keeps > that field number for the life of the index. I can sort of see how this would work when adding documents to a singe index. What about merging indicies via IndexWriter.addIndexes()? I guess this would require keeping the current way of merging around as a fallback? That's right. I still need to work on this. Currently you need to spec each index before hand to make sure they have the same fields. But it's just a matter of using the old merge model for adding heterogenous indexes. Does this mess up opening a MultiReader on multiple indicies constructed at different times? This is a common thing for people to do. Same as above. I still need to fix this. I'm yet to release all these new changes. > This allows one > FieldInfos object per index instead of one per segment. So when a new segment is written, the global FieldInfos may need to be updated. I guess this should be written after the new segment and before the "segments" file. That's exactly how I do it. I did consider putting it all in the "segments" file but I decided not to. I can't remember why right now. So I have a "segments" file and a "fields" file, the "segments" file being written last. > As I mentioned > earlier this greatly optimizes the merging of term vectors and stored > fields. The only problem I could find with this solution is that > fields are no longer in alphabetical order in the term dictionary but > I couldn't think of a use-case where this is necessary although I'm > sure there probably is one. Isn't an ordered term dictionary necessary to do lookups? Terms are alphabetically sorted, just not the fields. So if you add a "title" field and then a "content" field they'd have the numbers 0 and 1 respectively. Now if the title field has the terms "alpha" and "bravo" and the "content" field has the terms "apple" and "banana" then they'd be ordered like this; 0:alpha 0:bravo 1:apple 1:banana instead of like this; content:apple content:banana title:alpha title:bravo Notice the terms are correctly ordered in both but the fields aren't. Dave - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
On 7/10/06, David Balmain <[EMAIL PROTECTED]> wrote: I don't think declaring all fields up front is necessary for substantial optimizations. I've found that the key to some really good optimizations is having constant field numbers. That is, once a field is added to the index it is assigned a field number and it it keeps that field number for the life of the index. I can sort of see how this would work when adding documents to a singe index. What about merging indicies via IndexWriter.addIndexes()? I guess this would require keeping the current way of merging around as a fallback? Does this mess up opening a MultiReader on multiple indicies constructed at different times? This is a common thing for people to do. This allows one FieldInfos object per index instead of one per segment. So when a new segment is written, the global FieldInfos may need to be updated. I guess this should be written after the new segment and before the "segments" file. As I mentioned earlier this greatly optimizes the merging of term vectors and stored fields. The only problem I could find with this solution is that fields are no longer in alphabetical order in the term dictionary but I couldn't think of a use-case where this is necessary although I'm sure there probably is one. Isn't an ordered term dictionary necessary to do lookups? -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
On 7/11/06, Chuck Williams <[EMAIL PROTECTED]> wrote: David Balmain wrote on 07/10/2006 01:04 AM: > The only problem I could find with this solution is that > fields are no longer in alphabetical order in the term dictionary but > I couldn't think of a use-case where this is necessary although I'm > sure there probably is one. So presumably fields are still contiguous, you keep a pointer to where each field starts, and terms within the field remain in alphabetical order? Actually yes, that is how I did it although I'm not sure it's the best way now. I was hoping that by having a pointer to the start of each field there would be some good perfomance gains in searching but it turned out not to be the case. You really only save a couple of iterations in the getIndexOffset method. To make things easier though, you can just leave the TermInfosWriter/Reader almost as they are. The only difference though is that you store field numbers in the index rather than field names and when you compare terms while scanning the index, you also compare field numbers rather than field names. I don't know if I've described it very well but I hope that makes sense. Cheers, Dave PS. By the way, I don't know if I made this clear but the 5x speed up I was talking about comes during indexing. The performance improvement as far as search is concerned wasn't what I had hoped. It is a little faster but the bottle neck really comes from reading the documents from the index. So to alleviate that I've added lazy field loading which seems to work well. Actually, I've set it up so that I can read excerpts from fields without even loading the whole field so highlighting is super fast. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
Chris Hostetter wrote on 07/10/2006 02:06 AM: > As near as i can tell, the large issue can be sumarized with the following > sentiment: > > Performance gains could be realized if Field > properties were made fixed and homogeneous for > all Documents in an index. > This is certainly a large issue, as David says he has achieved a 5x performance gain. My interest in global field semantics originally sprang from functionality considerations, not performance considerations. I've got many features that require reasoning about field semantics. I previously mentioned a very simple one: validating fields in the query parser. More interesting examples are: 1. Multiple inheritance on the fields of documents that record the sources of each inherited value to support efficient incremental maintenance 2. "Record-valued fields" that store facets with values (e.g., time and user information for who set that value). These cannot easily be broken into multiple fields because the fields in question are multi-valued. 3. "Join fields" that reference id's of objects stored in separate indices (supporting queries that reference the fields in the joined index) Managing these kinds of rich semantic features in query parsing and indexing is greatly facilitated by a global field model. I've built this into my app, and then started thinking about benefits in Lucene generally from such a model. > 1) all Fields and their properties must be predeclared before any > document is ever added to the index, and any Field not declared is > illegal. > 2) a Field springs into existence the first time a Document is added > with a value for it -- but after that all newly added Documents with > a value for that field must conform to the Field properites initially > used. > > (have I missed any general approaches?) > Yes. Here is (an elaboration of) the "global model with exceptions" idea we reached: 3) There is a global field model in Lucene that contains the list of all known fields and their "default semantics". The class that contains this model supports a number of implicit and explicit methods to construct and query the model. The model can be evolved. The model is used many places in Lucene, in some cases according to application-settable properties. E.g.: a) Creating a Field uses the properties of the model so they need not be specified at each construction. A global model property determines whether or not field properties may be overridden, and whether or not fields may be created that are not in the model (in which case, they are automatically added to the model). b) The query parser has hooks that affect Query generation based on the model properties of the field (not just for certain special query types like Term's and RangeQuery's). The application can easily provide methods to implement these hooks. This is essential for features like 2&3 above (and beneficial for 1). > How would something like this work? > > docA.add(new Field(f, "bar", Store.YES, Index.UN_TOKENIZED)): > docA.add(new Field(f, "foo", Store.NO, Index.TOKENIZED)): > > docB.add(new Field(f, "x y", Store.YES, Index.TOKENIZED)): > docB.add(new Field(f, "z", Store.NO, Index.UN_TOKENIZED)): > The application could determine whether or not this kind of operation was supported accorded to the global enforcement properties of the model. If this is needed, the ability to have exceptions at the Field level would permit it. Hoss, do you have a use case requiring Store and Index variance like this? The impact of this flexibility on David's 5x is another question... Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
David Balmain wrote on 07/10/2006 01:04 AM: > The only problem I could find with this solution is that > fields are no longer in alphabetical order in the term dictionary but > I couldn't think of a use-case where this is necessary although I'm > sure there probably is one. So presumably fields are still contiguous, you keep a pointer to where each field starts, and terms within the field remain in alphabetical order? Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
: > Are there good reasons this path has not been followed? : : Hoss, that's your cue. I must admit, I haven't been able to fully follow this thread, perhaps it's just because it's late (no, that can't be it ... i started reading it at 3:30 this afternoon and then stoped because it was making my head hurt). In honestly, I probably would skimmed the whole thing without commenting if Marvin hadn't called me out onto the mat -- so I'll do my best to make sense of it. As near as i can tell, the large issue can be sumarized with the following sentiment: Performance gains could be realized if Field properties were made fixed and homogeneous for all Documents in an index. ...I've left this sentiment vague, and i'll ignore the implimentation specifics since i don't understand them -- but there seems to be two high level approaches that are involved, which are advocated to varying degrees by varying folks... 1) all Fields and their properties must be predeclared before any document is ever added to the index, and any Field not declared is illegal. 2) a Field springs into existence the first time a Document is added with a value for it -- but after that all newly added Documents with a value for that field must conform to the Field properites initially used. (have I missed any general approaches?) The questions (in my mind at least) are: a) How much performance gain can be realized by these limitations? b) Would it be possible to impliment these limitiations in such a way that they are "optional" for people willing to accept the trade off? c) if (b) is false, then is (a) great enough to warrant changing Lucene anyway? What exactly is sacrificed? I can't speak to (a) or (b) ... but I'll throw out some examples for (c) Regarding #1... If Fields must be predeclared, Lucene would lose two of the biggest advantages it has in my opinion: * The ability to evolve an index. To have an extremely large index, and to add a field to this index that is only used by "new" documents. This is not only usefull when the nature of you data changes (TPS Reports didn't use to have a "cover_sheet" field, and now they do) but also when the usage of an existing field changes and you don't want to rebuild from scratch (you've allways had an index "cover_sheet" field, and now you want it to be stored to .. so you change your index building code, and let it run for a little while, and then go back and reindex the old stuff later) * the ability to have dynamicly named fields. At CNET we have "attibutes" for products, those attributes are defined in a database, and the list of valid attributes is differnet based on the type of product. I don't know what they all are, and that list could change tomorow -- and i don't want to have to rebuild my index from scratch just because someone decided that laptops need a new attribute called "heat disopation factor" (note: Regarding #2... This approach wouldn't neccessarily conflict with the dynamicly named fields example above, but it would suffer the same "evolving index" problems. Last but not least is the high level issue of "homogeneous" Fields and Field properties for all documents. As has been pointed out, in many cases this is not that big of a deal, because even if you want heterogenous documents stored in a single index, you can construct a list of Fields which is the union of the Fields from your heterogenous Documents and use it -- hopefully no new requirement is added that all Documents must have a value for all fields. But what about complex iteractions between multi-values, stored, indexed fields? How would something like this work? docA.add(new Field(f, "bar", Store.YES, Index.UN_TOKENIZED)): docA.add(new Field(f, "foo", Store.NO, Index.TOKENIZED)): docB.add(new Field(f, "x y", Store.YES, Index.TOKENIZED)): docB.add(new Field(f, "z", Store.NO, Index.UN_TOKENIZED)): ...both docs have two "FIelds" for field name "f", both have a stored value for f, both have some indexed terms for f, both have some tokenized terms and one utokenized term for f ... but do these two docs both conform to the same "Global field semantics" ? -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
On 7/10/06, Doug Cutting <[EMAIL PROTECTED]> wrote: Chuck Williams wrote: > Lucene today allows many field properties to vary at the Field level. > E.g., the same field name might be tokenized in one Field on a Document > while it is untokenized in another Field on the same or different > Document. The rationale for this design was to keep the API simple. I think of it like variable declarations: some languages require them and some don't. I opted to make Lucene fields like dynamically-typed variables. In part, Lucene's popularity is due to the simplicity of its API. It's just now struck me the irony that most people are happy with the "dynamically-typed" fields in Java (Lucene) but they didn't go down as well in Ruby (Ferret). However, in my uses of Lucene, most documents have the same fields used in the same way, so I don't think I've ever actually taken much advantage of this functionality. It is nice to be able to add a field to an index by changing the indexing code in a single place, where the field's value is created, and not having to also change the index initialization code. We should try to keep such redundancies out of user code. Thus I would encourage any change in this direction to continue to permit fields to be defined lazily, the first time they are added, rather than requiring all fields to be declared up front. Are there substantial optimizations that are only possible if all fields are known when the index is initialized? I don't think declaring all fields up front is necessary for substantial optimizations. I've found that the key to some really good optimizations is having constant field numbers. That is, once a field is added to the index it is assigned a field number and it it keeps that field number for the life of the index. This allows one FieldInfos object per index instead of one per segment. As I mentioned earlier this greatly optimizes the merging of term vectors and stored fields. The only problem I could find with this solution is that fields are no longer in alphabetical order in the term dictionary but I couldn't think of a use-case where this is necessary although I'm sure there probably is one. Anyway, hopefully we'll be able to lead the way with some brilliant new ideas in the Lucy project. Put our money where our mouth is, so to speak. If only I had a little more time right now. Cheers, Dave - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
Chuck Williams wrote: Lucene today allows many field properties to vary at the Field level. E.g., the same field name might be tokenized in one Field on a Document while it is untokenized in another Field on the same or different Document. The rationale for this design was to keep the API simple. I think of it like variable declarations: some languages require them and some don't. I opted to make Lucene fields like dynamically-typed variables. In part, Lucene's popularity is due to the simplicity of its API. However, in my uses of Lucene, most documents have the same fields used in the same way, so I don't think I've ever actually taken much advantage of this functionality. It is nice to be able to add a field to an index by changing the indexing code in a single place, where the field's value is created, and not having to also change the index initialization code. We should try to keep such redundancies out of user code. Thus I would encourage any change in this direction to continue to permit fields to be defined lazily, the first time they are added, rather than requiring all fields to be declared up front. Are there substantial optimizations that are only possible if all fields are known when the index is initialized? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
On 7/10/06, Chuck Williams <[EMAIL PROTECTED]> wrote: David Balmain wrote on 07/09/2006 06:44 PM: > On 7/10/06, Chuck Williams <[EMAIL PROTECTED]> wrote: >> Marvin Humphrey wrote on 07/08/2006 11:13 PM: >> > >> > On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote: >> > >> >> Many things would be cleaner in Lucene if fields had a global >> semantics, >> >> i.e., if properties like text vs. binary, Index, Store, >> TermVector, the >> >> appropriate Analyzer, the assignment of Directory in >> ParallelReader (or >> >> ParallelWriter), etc. were a function of just the field name and the >> >> index. >> > >> > In June, Dave Balmain and I discussed the issue extensively on the >> > Ferret list. It might have been nice to use the Lucy list, since a >> > lot of the discussion was about Lucy, but the Lucy lists didn't exist >> > at the time. >> > >> > http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html >> > >> I think there are a number of problems with that proposal and hope it >> was not adopted. > > Hi Chuck, > > Actually, it was adopted and I'm quite happy with the solution. I'd be > very interested to hear what the number of problems are, besides the > example you've already given. Even if you never use Ferret, it can > only help me improve my software. Hi David, Thanks for your reply. I'm not aware of other problems beyond the ones I've already cited. After thinking of these, my confidence that there were not others waned. > > I'll start by covering your term-vector example. By adding fixed > index-wide field properties to Ferret I was able to obtain up to a > huge speed improvement during indexing. This is very interesting. Can you say how much? About a factor of 5 times. I won't compare it to Lucenes speed though as I know that's asking for trouble. You'll be able to try it yourself in a week or so when I finally release it. > With the CPU time I gain in Ferret I could > easily re-analyze large fields and build term vectors for them > separately. It's a little more work for less common use cases like > yours but in the end, everyone benifits in terms of performance. Does Ferret work this way, or would that be up to the application? Currently that would be up to the application. >> As my earlier example showed, there is at least one >> valid use case where storing a term vector is not an invariant property >> of a field; specifically, when using term vectors to optimize excerpt >> generation, it is best to store them only for fields that have long >> values. This is even a counter-example to Karl's proposal, since a >> single Document may have multiple fields of the same name, some with >> long values and others with short values; multiple fields of the same >> name may legitimately have different TermVector settings even on a >> single Document. > > I think you'll find if you look at the DocumentWriter#writePostings > method that it's "one in, all in" in terms of storing term vectors for > a field. That is, if you have 5 "content" fields and only one of those > is set to store term vectors, then all of the fields will store term > vectors. Right you are, and clearly necessarily so since the values of the multiple fields are implicitly concatenated (with positionIncrementGap). So, Lucene already limits my term vector optimization to the Document level. As it happens, I only use it for large body fields, of which each of my Documents has at most one. > >> I haven't thought of cases where Index or Store would legitimately vary >> across Fields or Documents, but am less convinced there aren't important >> use cases for these as well. Similarly, although it is important to >> allow term vectors to be on or off at the field level, I don't see any >> obvious need to vary the type of term vector (positions, offsets or >> both). > > I think Store could definitely legitimately vary across Fields or > Documents for the same reason your term vectors do. Perhaps you are > indexing pages from the web and you want to cache only the smaller > pages. That's an interesting example, but not as compelling an objection to me (and seemingly not to you either!). The app could always store an empty string without much consequence in this scenario. > >> There are significant benefits to global semantics, as evidenced by the >> fact that several of us independently came to desire this. However, >> deciding what can be global and what cannot is more subtle. > > I agree. I can't se
Re: Global field semantics
David Balmain wrote on 07/09/2006 06:44 PM: > On 7/10/06, Chuck Williams <[EMAIL PROTECTED]> wrote: >> Marvin Humphrey wrote on 07/08/2006 11:13 PM: >> > >> > On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote: >> > >> >> Many things would be cleaner in Lucene if fields had a global >> semantics, >> >> i.e., if properties like text vs. binary, Index, Store, >> TermVector, the >> >> appropriate Analyzer, the assignment of Directory in >> ParallelReader (or >> >> ParallelWriter), etc. were a function of just the field name and the >> >> index. >> > >> > In June, Dave Balmain and I discussed the issue extensively on the >> > Ferret list. It might have been nice to use the Lucy list, since a >> > lot of the discussion was about Lucy, but the Lucy lists didn't exist >> > at the time. >> > >> > http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html >> > >> I think there are a number of problems with that proposal and hope it >> was not adopted. > > Hi Chuck, > > Actually, it was adopted and I'm quite happy with the solution. I'd be > very interested to hear what the number of problems are, besides the > example you've already given. Even if you never use Ferret, it can > only help me improve my software. Hi David, Thanks for your reply. I'm not aware of other problems beyond the ones I've already cited. After thinking of these, my confidence that there were not others waned. > > I'll start by covering your term-vector example. By adding fixed > index-wide field properties to Ferret I was able to obtain up to a > huge speed improvement during indexing. This is very interesting. Can you say how much? > With the CPU time I gain in Ferret I could > easily re-analyze large fields and build term vectors for them > separately. It's a little more work for less common use cases like > yours but in the end, everyone benifits in terms of performance. Does Ferret work this way, or would that be up to the application? > >> As my earlier example showed, there is at least one >> valid use case where storing a term vector is not an invariant property >> of a field; specifically, when using term vectors to optimize excerpt >> generation, it is best to store them only for fields that have long >> values. This is even a counter-example to Karl's proposal, since a >> single Document may have multiple fields of the same name, some with >> long values and others with short values; multiple fields of the same >> name may legitimately have different TermVector settings even on a >> single Document. > > I think you'll find if you look at the DocumentWriter#writePostings > method that it's "one in, all in" in terms of storing term vectors for > a field. That is, if you have 5 "content" fields and only one of those > is set to store term vectors, then all of the fields will store term > vectors. Right you are, and clearly necessarily so since the values of the multiple fields are implicitly concatenated (with positionIncrementGap). So, Lucene already limits my term vector optimization to the Document level. As it happens, I only use it for large body fields, of which each of my Documents has at most one. > >> I haven't thought of cases where Index or Store would legitimately vary >> across Fields or Documents, but am less convinced there aren't important >> use cases for these as well. Similarly, although it is important to >> allow term vectors to be on or off at the field level, I don't see any >> obvious need to vary the type of term vector (positions, offsets or >> both). > > I think Store could definitely legitimately vary across Fields or > Documents for the same reason your term vectors do. Perhaps you are > indexing pages from the web and you want to cache only the smaller > pages. That's an interesting example, but not as compelling an objection to me (and seemingly not to you either!). The app could always store an empty string without much consequence in this scenario. > >> There are significant benefits to global semantics, as evidenced by the >> fact that several of us independently came to desire this. However, >> deciding what can be global and what cannot is more subtle. > > I agree. I can't see global field semantics making it into Lucene in > the short term. It's a rather large change, particularly if you want > to make full use of the performance benifits it affords. Could you summarize where these derive from? > >> Perhaps the best thing at the Lucene level is to
Re: Global field semantics
On 7/10/06, Chuck Williams <[EMAIL PROTECTED]> wrote: Marvin Humphrey wrote on 07/08/2006 11:13 PM: > > On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote: > >> Many things would be cleaner in Lucene if fields had a global semantics, >> i.e., if properties like text vs. binary, Index, Store, TermVector, the >> appropriate Analyzer, the assignment of Directory in ParallelReader (or >> ParallelWriter), etc. were a function of just the field name and the >> index. > > In June, Dave Balmain and I discussed the issue extensively on the > Ferret list. It might have been nice to use the Lucy list, since a > lot of the discussion was about Lucy, but the Lucy lists didn't exist > at the time. > > http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html > I think there are a number of problems with that proposal and hope it was not adopted. Hi Chuck, Actually, it was adopted and I'm quite happy with the solution. I'd be very interested to hear what the number of problems are, besides the example you've already given. Even if you never use Ferret, it can only help me improve my software. I'll start by covering your term-vector example. By adding fixed index-wide field properties to Ferret I was able to obtain up to a huge speed improvement during indexing. I believe Marvin has had similar success using his own merge model and with fixed field properties in KinoSearch. With the CPU time I gain in Ferret I could easily re-analyze large fields and build term vectors for them separately. It's a little more work for less common use cases like yours but in the end, everyone benifits in terms of performance. As my earlier example showed, there is at least one valid use case where storing a term vector is not an invariant property of a field; specifically, when using term vectors to optimize excerpt generation, it is best to store them only for fields that have long values. This is even a counter-example to Karl's proposal, since a single Document may have multiple fields of the same name, some with long values and others with short values; multiple fields of the same name may legitimately have different TermVector settings even on a single Document. I think you'll find if you look at the DocumentWriter#writePostings method that it's "one in, all in" in terms of storing term vectors for a field. That is, if you have 5 "content" fields and only one of those is set to store term vectors, then all of the fields will store term vectors. As another counter-example from my own app which I'd forgotten yesterday, an important case where the Analyzer will vary across documents is for i18n, where different languages require different analyzers. Refuting again my own argument about this not being consistent with query parsing, the language of the query is a distinct property from the languages of various documents in the collection. In my app, I let the user specify the language of the query, while the language of each Document is determined automatically. So, analyzers vary for both queries and documents, but independently. Ferret doesn't record any details about analysis in the field properties. I definitely agree with you here. I haven't thought of cases where Index or Store would legitimately vary across Fields or Documents, but am less convinced there aren't important use cases for these as well. Similarly, although it is important to allow term vectors to be on or off at the field level, I don't see any obvious need to vary the type of term vector (positions, offsets or both). I think Store could definitely legitimately vary across Fields or Documents for the same reason your term vectors do. Perhaps you are indexing pages from the web and you want to cache only the smaller pages. There are significant benefits to global semantics, as evidenced by the fact that several of us independently came to desire this. However, deciding what can be global and what cannot is more subtle. I agree. I can't see global field semantics making it into Lucene in the short term. It's a rather large change, particularly if you want to make full use of the performance benifits it affords. Perhaps the best thing at the Lucene level is to have a notion of default semantics for a field name. Whenever a Field of that name is constructed, those semantics would be used unless the constructor overrides them. This would allow additional constructors on Field with simpler signatures for the common case of invariant Field properties. It would also allow applications to access the class that holds the default field information for an index. The application will know which properties it can rely on as invariant and whether or not the set of fields is closed. This approach would preserve upward compatibility and provide, I believe, most of the benefits we
Re: Global field semantics
On Jul 9, 2006, at 11:31 AM, Chuck Williams wrote: Marvin Humphrey wrote on 07/08/2006 11:13 PM: On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote: Many things would be cleaner in Lucene if fields had a global semantics, i.e., if properties like text vs. binary, Index, Store, TermVector, the appropriate Analyzer, the assignment of Directory in ParallelReader (or ParallelWriter), etc. were a function of just the field name and the index. In June, Dave Balmain and I discussed the issue extensively on the Ferret list. It might have been nice to use the Lucy list, since a lot of the discussion was about Lucy, but the Lucy lists didn't exist at the time. http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html I think there are a number of problems with that proposal and hope it was not adopted. The email which kicks off the thread is Dave's initial proposal for *Ferret*. That's outside the domain of Apache Lucene. Dave did not submit it as a proposal for either Lucy or Lucene. There's an extended discussion which follows where a number of ideas are kicked around, some of them related to Lucy. Since you respond only to that one email, perhaps you did not read the rest of the thread. You asked for prior discussion, and I gave you a link to prior discussion. Here is the quote from my original email, with the parts which you silently snipped restored: Has this been considered before? Robert Kirchgessner made some of the same arguments in a January thread. They were compelling then, and they're compelling now. http://mail-archives.apache.org/mod_mbox/lucene-java-dev/ 200601.mbox/[EMAIL PROTECTED] In June, Dave Balmain and I discussed the issue extensively on the Ferret list. It might have been nice to use the Lucy list, since a lot of the discussion was about Lucy, but the Lucy lists didn't exist at the time. http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html I did not intend to submit Dave's Ferret proposal by proxy to this group. I don't have time right now to defend something which was never meant for either Lucene or Lucy at length. I know that Dave doesn't either. I regret having provided the link. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
Marvin Humphrey wrote on 07/08/2006 11:13 PM: > > On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote: > >> Many things would be cleaner in Lucene if fields had a global semantics, >> i.e., if properties like text vs. binary, Index, Store, TermVector, the >> appropriate Analyzer, the assignment of Directory in ParallelReader (or >> ParallelWriter), etc. were a function of just the field name and the >> index. > > In June, Dave Balmain and I discussed the issue extensively on the > Ferret list. It might have been nice to use the Lucy list, since a > lot of the discussion was about Lucy, but the Lucy lists didn't exist > at the time. > > http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html > I think there are a number of problems with that proposal and hope it was not adopted. As my earlier example showed, there is at least one valid use case where storing a term vector is not an invariant property of a field; specifically, when using term vectors to optimize excerpt generation, it is best to store them only for fields that have long values. This is even a counter-example to Karl's proposal, since a single Document may have multiple fields of the same name, some with long values and others with short values; multiple fields of the same name may legitimately have different TermVector settings even on a single Document. As another counter-example from my own app which I'd forgotten yesterday, an important case where the Analyzer will vary across documents is for i18n, where different languages require different analyzers. Refuting again my own argument about this not being consistent with query parsing, the language of the query is a distinct property from the languages of various documents in the collection. In my app, I let the user specify the language of the query, while the language of each Document is determined automatically. So, analyzers vary for both queries and documents, but independently. I haven't thought of cases where Index or Store would legitimately vary across Fields or Documents, but am less convinced there aren't important use cases for these as well. Similarly, although it is important to allow term vectors to be on or off at the field level, I don't see any obvious need to vary the type of term vector (positions, offsets or both). There are significant benefits to global semantics, as evidenced by the fact that several of us independently came to desire this. However, deciding what can be global and what cannot is more subtle. Perhaps the best thing at the Lucene level is to have a notion of default semantics for a field name. Whenever a Field of that name is constructed, those semantics would be used unless the constructor overrides them. This would allow additional constructors on Field with simpler signatures for the common case of invariant Field properties. It would also allow applications to access the class that holds the default field information for an index. The application will know which properties it can rely on as invariant and whether or not the set of fields is closed. This approach would preserve upward compatibility and provide, I believe, most of the benefits we all seek. Thoughts? Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote: Many things would be cleaner in Lucene if fields had a global semantics, i.e., if properties like text vs. binary, Index, Store, TermVector, the appropriate Analyzer, the assignment of Directory in ParallelReader (or ParallelWriter), etc. were a function of just the field name and the index. This is the direction I would like to go. This approach would naturally admit a class, say IndexFieldSet, that would hold global field semantics for an index. Lucene today allows many field properties to vary at the Field level. E.g., the same field name might be tokenized in one Field on a Document while it is untokenized in another Field on the same or different Document. Does anybody know how often this flexibility is used? Are there interesting use cases for which it is important? It seems to me this functionality is already problematic and not fully supported; e.g., indexing can manage tokenization-variant fields, but query parsing cannot. Various extensions to Lucene exacerbate this kind of problem. Perhaps more controversially, the notion of global field semantics would be even stronger if the set of fields is closed. This would allow, for example, QueryParser to validate field names. This has a number of benefits, including for example avoiding false-negative "no results" due to misspelling a field name. Has this been considered before? Robert Kirchgessner made some of the same arguments in a January thread. They were compelling then, and they're compelling now. http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200601.mbox/% [EMAIL PROTECTED] In June, Dave Balmain and I discussed the issue extensively on the Ferret list. It might have been nice to use the Lucy list, since a lot of the discussion was about Lucy, but the Lucy lists didn't exist at the time. http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html Thoughts on the document storage that occurred to me after that discussion: maybe the fdx file should spec two numbers: a file pointer, and a integer which indicates the class of object stored at that position in the fdt file. The registry which maps integers to classes could be stored in some centralized file. Perhaps one of these classes -- a LazyDoc -- could specify that only a few integer file pointers should be read right away, deferring reading of field data until later. Are there good reasons this path has not been followed? Hoss, that's your cue. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
karl wettin wrote on 07/08/2006 12:27 PM: > On Sat, 2006-07-08 at 11:08 -0700, Chuck Williams wrote: > >> Karl, do you have specific reasons or use cases to normalize fields at >> Document rather than at Index? >> > > Nothing more than that the way the API looks it implies features that > does not exist. Boost, store, index and vectors. I've learned, but I'm > certain lots of newbies does the same assumptions as I did. > I forgot one of my own use cases! My app uses term vectors as an optimization for determining excerpts (aka summaries). Term vectors increase the index size. For large documents, the performance benefits of using term vectors to find excerpts are large, but for small documents they are non-existent or negative. So, to optimize performance and minimize index size, I store term vectors on the relevant fields only when their values are sufficiently large. This is a concrete example of using the same field name with different Field.TermVector values on different Documents. Are there any similar examples for Field.Index or Field.Store? Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
On Sat, 2006-07-08 at 11:08 -0700, Chuck Williams wrote: > > Karl, do you have specific reasons or use cases to normalize fields at > Document rather than at Index? Nothing more than that the way the API looks it implies features that does not exist. Boost, store, index and vectors. I've learned, but I'm certain lots of newbies does the same assumptions as I did. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
karl wettin wrote on 07/08/2006 10:27 AM: > On Sat, 2006-07-08 at 09:46 -0700, Chuck Williams wrote: > >> Many things would be cleaner in Lucene if fields had a global semantics, >> > > >> Has this been considered before? Are there good reasons this path has >> not been followed? >> > > I've been posting some advocacy about the current Field. Basically I > would like to see a more normalized field setting per document (instead > of normalizing it in the writer), and I've been talking about something > like this: > > [Document]<#>--- {1..*} ->[Value]-->[Field +name +store +index +vector] > A > | {0..*} > | > [Index] > > And what I'm after would look like this: [Document]<#>--- {1..*} ->[Value] A | {*..1} | [Field +store +index +vector +analyzer +directory] A | {1..1} | [FieldName] A | {0..*} | [Index] The key points are to have Index be a first-class object and to have field names uniquely specify field properties. Karl, do you have specific reasons or use cases to normalize fields at Document rather than at Index? Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Global field semantics
On Sat, 2006-07-08 at 09:46 -0700, Chuck Williams wrote: > Many things would be cleaner in Lucene if fields had a global semantics, > Has this been considered before? Are there good reasons this path has > not been followed? I've been posting some advocacy about the current Field. Basically I would like to see a more normalized field setting per document (instead of normalizing it in the writer), and I've been talking about something like this: [Document]<#>--- {1..*} ->[Value]-->[Field +name +store +index +vector] A | {0..*} | [Index] I've done lots of changes and added new features like this to my own branch as it takes ten times the effort to fix deprications and backwards compatibility for these things. I'm so up for a Lucene 3.0 code sandbox. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Global field semantics
Many things would be cleaner in Lucene if fields had a global semantics, i.e., if properties like text vs. binary, Index, Store, TermVector, the appropriate Analyzer, the assignment of Directory in ParallelReader (or ParallelWriter), etc. were a function of just the field name and the index. This approach would naturally admit a class, say IndexFieldSet, that would hold global field semantics for an index. Lucene today allows many field properties to vary at the Field level. E.g., the same field name might be tokenized in one Field on a Document while it is untokenized in another Field on the same or different Document. Does anybody know how often this flexibility is used? Are there interesting use cases for which it is important? It seems to me this functionality is already problematic and not fully supported; e.g., indexing can manage tokenization-variant fields, but query parsing cannot. Various extensions to Lucene exacerbate this kind of problem. Perhaps more controversially, the notion of global field semantics would be even stronger if the set of fields is closed. This would allow, for example, QueryParser to validate field names. This has a number of benefits, including for example avoiding false-negative "no results" due to misspelling a field name. Has this been considered before? Are there good reasons this path has not been followed? Thanks for any info, Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]