Re: SOLR-1131 - Multiple Fields per Field Type
On Dec 10, 2009, at 1:01 AM, Mattmann, Chris A (388J) wrote: Hi Yonik et al., I¹d like to add: Option C: Sub fields are specified as a attribute on the fieldType tag // needed to essentially define the point type fieldType name=latlon class=GeoPoint subFieldSuffix=_latlon ../ // uses of the latlon type field name=home type=latlon indexed=true stored=false/ // subFieldSuffix is appended to the subFields indexed and thus those would be: home_latlon_0 home_latlon_1 I like elements of Option B that you present below, however it seems to be mixing concerns. Type inheritance (aka your subFieldType attribute) seems to be orthogonal to poly fields -- a good idea, but another issue IMHO. I'm not sure this works, as you need to specify the type of the subfield, which is what Option B does. I don't think inheritance is the what is going on here, more like delegation, and that isn't necessarily needed for all implementations, but just happens to be how it is done for the example in question. People implementing FieldTypes could certainly just encode things the way they want using their own internal mechanism (or the existing ones, but w/o configuration). Cheers, Chris On 12/9/09 1:12 PM, Yonik Seeley yo...@lucidimagination.com wrote: Proposal for handling points using only the field lookup mechanisms currently in place in IndexSchema: Option A: dynamic fields used for subfields, those dynamic fields need to be explicitly defined in the XML // needed to essentially define the point type fieldType name=latlon class=TrieDoubleFIeld precisionStep=8/ fieldType name=point subFieldSuffix=_latlon .../ dynamicField name=*_latlon type=latlon indexed=true stored=false/ // uses of the point type field name=home type=point/ dynamicField name=*_point type=point/ // subFieldSuffix is appended to the subFields indexed and thus those would be home__0_latlon home__1_latlon // And the indexed fields for dynamic field work_point would be work_point__0_latlon work_point__1_latlon // NOTE: this scheme works fine for subFields with different fieldTypes Option B: dynamic fields used for subfields, dynamic fields inserted into schema automatically // needed to essentially define the point type fieldType name=latlon class=TrieDoubleFIeld precisionStep=8/ fieldType name=point subFieldType=latlon/ // uses of the point type field name=home type=point/ dynamicField name=*_point type=point/ // A dynamic field is inserted into the schema by the point class of the form __subFieldTypeName by default. // This could be changed via an optional subFieldSuffix param on the point fieldType. double underscore used // to minimize collisions with user-defined dynamic fields. home_0__latlon home_1__latlon // And the indexed fields for dynamic field work_point would be work_point__0__latlon work_point__1__latlon // NOTE: this scheme works fine for subFields with different fieldTypes -Yonik http://www.lucidimagination.com ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: SOLR-1131 - Multiple Fields per Field Type
Hi Grant, On 12/10/09 3:16 AM, Grant Ingersoll gsing...@apache.org wrote: I'm not sure this works, as you need to specify the type of the subfield, which is what Option B does. I don't think inheritance is the what is going on here, more like delegation, and that isn't necessarily needed for all implementations, but just happens to be how it is done for the example in question. People implementing FieldTypes could certainly just encode things the way they want using their own internal mechanism (or the existing ones, but w/o configuration). Well if that doesn't work then option A doesn't work either because it doesn't specify the subFieldType. I'm also not sure that option B works too because what if there are multiple subFieldTypes? For instance, what if I want to store one of the polyField's subTypes as a tint,and the other as a regular int? How would that be specified. Either way you need some combination of A + B, or (my preference) B + C. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: SOLR-1131 - Multiple Fields per Field Type
I¹ll try and hold off, but also work on a patch for option (B+)C :) On 12/10/09 7:37 AM, Grant Ingersoll gsing...@apache.org wrote: I have Option B implemented at this point, minus a few tests passing. I'll put up a patch as soon as I get it working. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: SOLR-1131 - Multiple Fields per Field Type
: My current thought on #1 is that we probably don't want to change the : internal lookup mechanism used by IndexSchema unless we gain : significant power by doing so. I'm not sure I currently see it. : : My thoughts on #2 is more on a case-by-case basis. For the simple : case of a point class with two fields indexed separately, referencing : a suffix that should be defined as a dynamic field vs referencing a : type seem pretty close. The latter, while perhaps slightly simpler : for the user, seem to introduce a lot of hidden complexities. I'm : less concerned than Hoss is about name clashes, but much more : concerned about those complexities. ...this was ultimately my point. Name clashes were a concrete example i was trying to use to illustrate the (broader) complexity that seems involved in trying to develope a generalized PolyField system that is 100% transparent to both end users and schema creators. My personal view is that none of this stuff should be transparent to the schema creator: we shouldn't try to hide things from them -- having defaults so they don't have to worry about some details is fine, but they should have control over those details if they want. That said: although i have an opinion, it's not a particularly strong opinion. Since i appear to be the minority here (and feel reasonably confident that i've explained my concerns well enough that you guys understand me point even if you don't agree with it) i'll just say i'm +1 at leaving the schema creator entirely in control of the names created by PolyFields; and -0 at having fields created entirely transparently; ... and leave it at that. -Hoss
Re: SOLR-1131 - Multiple Fields per Field Type
: I'm not really understanding the value of an approach like that. for : starters, what Lucene field names would ultimately be created in those : examples? : : The first field would be named location__location. : The second field would be named location_home_location_home. : The third field would be named location_work_location_work. I'm not understanding your answer -- the whole point of this example is that this hypothetical PolyField example is that for each field the user knows about, it's builting up a lat and a lon field under the covers -- my question is what real, under the covers, names do you suggest be created from teh type of configuration you suggested? : field name=other_location type=latlon/ : dynamicField name=*_dynamic_location type=latlon/ : : ...then what field names would be created under the covers? : : : In general, it would be FieldType#getPattern().stripOffEndRegexStarStuff() + : Field#getName(). ...that still feels a lot more obscure then letting the field declaration control things -- because as a schema creator i not only have to check that the field name i want to add doesn't conflict with an existing field name (or dynamicField name glob) but i also have to check that it doesn't conflict with a pattern in a PolyField fieldType declaration. why make people check too things when using dynamicFields is something they already understand? : Well if this feels wrong to you then I think the schema.xml file that ships : with SOLR should also feel wrong as well because it uses the exact same : pattern for defining field type variations. That is, differences between : FieldType representations for ints and tints are not stored as variations on : the SchemaField definition itself but they are stored as variation on the : FieldTypes (e.g., a different precisionStep in the case of int [0] versus : that of tint [8]). Based on what you are proposing, why isn't precisionStep : an attribute on field, rather than fieldType in those examples? There's a huge differnet there -- nothing in a fieldTYpe/ declaration right now has any influence what so ever on the ultimate name of the fields used -- field/ declaration's can inherit a lot of stuff from fieldType/ but we've never let the fieldType/ influence the name. In an existing solr schema, i can have a list of fieldType/s and i can have a list of field/s that refrence those fieldType's by name -- and i can tweak the settings on those things laregly independently (as long as i reindex) ... but i never have to wworry that tweaking the setting of fieldType might completley break an index by causing the underlyling name of some field name=a.../ to suddenly collide with some other field name=a__a .../ : Possibly. It's also a lot less traceable. It's implicit versus explicit, : which I'm not sure leads to simplicity in the end. I feel the exact opposite actaully. Saying these PolyField types will create multiple fields under the covers, and you have to use them as dynamicFields/ do control what names they use seems a lot more explict and easily tracable then these PolyField types will create multiple fileds under the covers, so you specify a pattern when you declare them, and then when you declare fields or dynamicFields that use them the following rule(s) will be applied to generate the underlying field names, so remember this rule when naming other fields to prevent conflicts. -Hoss
Re: SOLR-1131 - Multiple Fields per Field Type
: fieldType name=point type=solr.PointType dimension=2 subFieldType=double/ : field name=home type=point indexed=true stored=true/ ... : And a new document of: : doc : field name=point39.0 -79.434/field : /doc : : There are three fields created: : home -- Contains the stored value : home___0 - Contains 39.0 indexed as a double (as in the double FieldType, not just a double precision) : home___1 - Contains -79.434 as a double Grant: All of this i understand -- the back and forth Mattmann and I have been having is specificly about the idea that the __0 and __1 should be more transparent when declaring the schema. AS it stands right now, if i add this to my schema... field name=home___0 type=int indexed=true stored=true/ ...i can really break things. The odds of that happening are probably low, but it would still be very easy to make this type more transparent to schema creators by requring that PolyFields be declared as dynamicFields. so your previous example would become... : fieldType name=point type=solr.PointType dimension=2 subFieldType=double/ : dynamicField name=home* type=point indexed=true stored=true/ ...now if i'm stupid enough to add field name=home___0/ it's my own damn fault (just like it is right now w/o having PolyFields in Solr) : letting dynamicField/ drive everything just seems a *lot* simpler ... : both as far as implementation, and as far as maintaining the schema. : : I don't agree. It requires more configuration and more knowledge by the end user and doesn't hid the details. 1) My example requires 8 more characters then yours. 2) The end user doesn't need to know it's a dynamic field, they still just deal with a field named home 3) my whole point is that we shouldn't be hiding these details from the person editing the schema.xml -Hoss
Re: SOLR-1131 - Multiple Fields per Field Type
: That's not how the Cartesian Field stuff works, but I think I see what : you are getting at and I would say I'm going to explicitly punt on that : right now. Ultimately, I think when such a case comes up, the FieldType : needs to be configured to be able to determine this information. I'm fine punting on it -- i don't understand half this stuff anyway -- i just wanted to raise the issue incase someone else said Ack! ... yes that is a big oversight in the API that will cause problems with X, Y, Z. -Hoss
Re: SOLR-1131 - Multiple Fields per Field Type
On Dec 9, 2009, at 2:04 PM, Chris Hostetter wrote: : fieldType name=point type=solr.PointType dimension=2 subFieldType=double/ : field name=home type=point indexed=true stored=true/ ... : And a new document of: : doc : field name=point39.0 -79.434/field : /doc : : There are three fields created: : home -- Contains the stored value : home___0 - Contains 39.0 indexed as a double (as in the double FieldType, not just a double precision) : home___1 - Contains -79.434 as a double Grant: All of this i understand -- the back and forth Mattmann and I have been having is specificly about the idea that the __0 and __1 should be more transparent when declaring the schema. AS it stands right now, if i add this to my schema... field name=home___0 type=int indexed=true stored=true/ ...i can really break things. The odds of that happening are probably low, but it would still be very easy to make this type more transparent to schema creators by requring that PolyFields be declared as dynamicFields. so your previous example would become... : fieldType name=point type=solr.PointType dimension=2 subFieldType=double/ : dynamicField name=home* type=point indexed=true stored=true/ ...now if i'm stupid enough to add field name=home___0/ it's my own damn fault (just like it is right now w/o having PolyFields in Solr) : letting dynamicField/ drive everything just seems a *lot* simpler ... : both as far as implementation, and as far as maintaining the schema. : : I don't agree. It requires more configuration and more knowledge by the end user and doesn't hid the details. 1) My example requires 8 more characters then yours. It's not about the characters, obviously, it's about the mindset of the person doing the modeling, hence... 2) The end user doesn't need to know it's a dynamic field, they still just deal with a field named home 3) my whole point is that we shouldn't be hiding these details from the person editing the schema.xml I'm not sure I agree. I think people would expect to use a new Field Type in exactly the same ways the use existing Field Types, namely anywhere they want (dynamic or not). We could easily validate the schema at start up time to see whether they have done the scenario you describe above and throw an exception.
Re: SOLR-1131 - Multiple Fields per Field Type
I haven't followed this whole thread... but I wanted to point out that it probably intersects with the review of grant's latest patch that I did here: https://issues.apache.org/jira/browse/SOLR-1131 I did want to cut'n'paste something from that post: : I do want to separate these two issues though: : 1) field lookup mechanism (currently just exact name in schema followed by a dynamic field check) : 2) if and when fields or field types should be explicitly defined in the schema vs being created by the polyField My current thought on #1 is that we probably don't want to change the internal lookup mechanism used by IndexSchema unless we gain significant power by doing so. I'm not sure I currently see it. My thoughts on #2 is more on a case-by-case basis. For the simple case of a point class with two fields indexed separately, referencing a suffix that should be defined as a dynamic field vs referencing a type seem pretty close. The latter, while perhaps slightly simpler for the user, seem to introduce a lot of hidden complexities. I'm less concerned than Hoss is about name clashes, but much more concerned about those complexities. -Yonik http://www.lucidimagination.com
Re: SOLR-1131 - Multiple Fields per Field Type
Hi All, : fieldType name=point type=solr.PointType dimension=2 subFieldType=double/ : field name=home type=point indexed=true stored=true/ ... : And a new document of: : doc : field name=point39.0 -79.434/field : /doc : : There are three fields created: : home -- Contains the stored value : home___0 - Contains 39.0 indexed as a double (as in the double FieldType, not just a double precision) : home___1 - Contains -79.434 as a double Grant: All of this i understand -- the back and forth Mattmann and I have been having is specificly about the idea that the __0 and __1 should be more transparent when declaring the schema. AS it stands right now, if i add this to my schema... field name=home___0 type=int indexed=true stored=true/ ...i can really break things. The odds of that happening are probably low, but it would still be very easy to make this type more transparent to schema creators by requring that PolyFields be declared as dynamicFields. so your previous example would become... : fieldType name=point type=solr.PointType dimension=2 subFieldType=double/ : dynamicField name=home* type=point indexed=true stored=true/ ...now if i'm stupid enough to add field name=home___0/ it's my own damn fault (just like it is right now w/o having PolyFields in Solr) : letting dynamicField/ drive everything just seems a *lot* simpler ... : both as far as implementation, and as far as maintaining the schema. : : I don't agree. It requires more configuration and more knowledge by the end user and doesn't hid the details. 1) My example requires 8 more characters then yours. It's not about the characters, obviously, it's about the mindset of the person doing the modeling, hence... +1. 2) The end user doesn't need to know it's a dynamic field, they still just deal with a field named home 3) my whole point is that we shouldn't be hiding these details from the person editing the schema.xml I'm not sure I agree. I think people would expect to use a new Field Type in exactly the same ways the use existing Field Types, namely anywhere they want (dynamic or not). We could easily validate the schema at start up time to see whether they have done the scenario you describe above and throw an exception. +1 to that, as well. I had mentioned in an earlier thread about using the APIs and code to perform such a check. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: SOLR-1131 - Multiple Fields per Field Type
On Wed, Dec 9, 2009 at 2:41 PM, Yonik Seeley yo...@lucidimagination.com wrote: So... the question is, do we have a concrete alternative to this that is well fleshed out? I do, I do... just a little variant that is geo specific and hence results in nicer names :-) fieldType name=point latSuffix=_lat lonSuffix=_lon/ field name=home type=point/ dynamicField name=*_lat type=tdouble indexed=true stored=false/ dynamicField name=*_lon type=tdouble indexed=true stored=false/ dynamicField name=*_point type=point/ home_lat home_lon work_point_lat work_point_lon Note: if you want the double or tripple underscore to help prevent collisions... then you could use latSuffix=___lat and define the dynamic fields that way. -Yonik http://www.lucidimagination.com On Wed, Dec 9, 2009 at 2:41 PM, Yonik Seeley yo...@lucidimagination.com wrote: Here's an example of how everything could work with dynamic fields (apologies if it it overlaps with examples already given by others in this thread) : fieldType name=point fieldSuffix=_latlon .../ // the subFields for the points end in _latlon field name=home type=point/ dynamicField name=*_latlon type=tdouble indexed=true stored=false/ // And we also want to allow point dynamic fields dynamicField name=*_point type=point/ // Note: Grant make point more generic than geo, so it's 0 and 1 instead of lat and lon // OK, so now the indexed fields for home would be home__0_latlon home__1_latlon // And the indexed fields for dynamic field work_point would be work_point__0_latlon work_point__1_latlon Not the prettiest names... but I think everything is well defined (how it would work with subFields of differing types.. have another param specifying a different suffix, how it works with dynamic fields, etc). -Yonik http://www.lucidimagination.com
Re: SOLR-1131 - Multiple Fields per Field Type
On Dec 9, 2009, at 2:46 PM, Yonik Seeley wrote: On Wed, Dec 9, 2009 at 2:41 PM, Yonik Seeley yo...@lucidimagination.com wrote: So... the question is, do we have a concrete alternative to this that is well fleshed out? I do, I do... just a little variant that is geo specific and hence results in nicer names :-) fieldType name=point latSuffix=_lat lonSuffix=_lon/ field name=home type=point/ dynamicField name=*_lat type=tdouble indexed=true stored=false/ dynamicField name=*_lon type=tdouble indexed=true stored=false/ dynamicField name=*_point type=point/ home_lat home_lon work_point_lat work_point_lon Note: if you want the double or tripple underscore to help prevent collisions... then you could use latSuffix=___lat and define the dynamic fields that way. Additionally, how do you deal w/ a point in a 3D (or n-D) space? I just don't see why a user shouldn't be able to use the FieldType just like any other FieldType, dynamic or not. I think it is easy enough to detect name collisions and you still get all the flexibility of dynamic fields. So, for example, say I was modeling a user and their employment history. Thus, I have a single home address plus multiple work addresses. One way of doing this would be: field name=home type=point/ dynamicField name=work_* type=point/ And that should all just work. The user would just ever deal w/ home or work_*, but not have to deal w/ home___0 or whatever unless they really truly wanted to and even then I am not sure it is needed. How would you do this with what is proposed above? Seems like you'd have a whole proliferation of fields. Also, I don't see why a FieldType should have a dep. on a Field. Having a dependency on another FieldType seems reasonable, but I'm not sure about on a Field.
Re: SOLR-1131 - Multiple Fields per Field Type
On Wed, Dec 9, 2009 at 3:21 PM, Grant Ingersoll gsing...@apache.org wrote: Additionally, how do you deal w/ a point in a 3D (or n-D) space? I guess you would go back to the way you did it (0,1,etc). This was really just a naming variation, not really a different approach. I just don't see why a user shouldn't be able to use the FieldType just like any other FieldType, dynamic or not. I think it is easy enough to detect name collisions and you still get all the flexibility of dynamic fields. So, for example, say I was modeling a user and their employment history. Thus, I have a single home address plus multiple work addresses. One way of doing this would be: field name=home type=point/ dynamicField name=work_* type=point/ And that should all just work. But it isn't that simple: you needed to define the point type, and that point type needed to reference/define another type. In the dynamicField proposal, you need to define a _latlon dynamic field once. It's also a separate decision from the lookup mechanism (dynamic field based, or add a new poly-field mechanism) - the point field type could choose to dynamically register *_latlon if it isn't already registered. [...] How would you do this with what is proposed above? Seems like you'd have a whole proliferation of fields. I thought I defined it well... hmmm. I'll take another stab, outlining using dynamic fields in both scenarios (explicitly defined dynamic fields, and automatically defined as part of the creation of the point class). I think we really do need to get concrete about our options at this point. -Yonik
Re: SOLR-1131 - Multiple Fields per Field Type
On Dec 9, 2009, at 3:47 PM, Yonik Seeley wrote: I thought I defined it well... hmmm. I'll take another stab, outlining using dynamic fields in both scenarios (explicitly defined dynamic fields, and automatically defined as part of the creation of the point class). I think we really do need to get concrete about our options at this point. Agreed, code would be good.
Re: SOLR-1131 - Multiple Fields per Field Type
On Wed, Dec 9, 2009 at 3:49 PM, Grant Ingersoll gsing...@apache.org wrote: On Dec 9, 2009, at 3:47 PM, Yonik Seeley wrote: I thought I defined it well... hmmm. I'll take another stab, outlining using dynamic fields in both scenarios (explicitly defined dynamic fields, and automatically defined as part of the creation of the point class). I think we really do need to get concrete about our options at this point. Agreed, code would be good. I had code (untested) just using dynamic fields... you changed it :-P But I meant actual fieldType and field definitions, and what fields get indexed as a result, and how type lookups on those fields happens. -Yonik http://www.lucidimagination.com
Re: SOLR-1131 - Multiple Fields per Field Type
On Dec 9, 2009, at 3:52 PM, Yonik Seeley wrote: On Wed, Dec 9, 2009 at 3:49 PM, Grant Ingersoll gsing...@apache.org wrote: On Dec 9, 2009, at 3:47 PM, Yonik Seeley wrote: I thought I defined it well... hmmm. I'll take another stab, outlining using dynamic fields in both scenarios (explicitly defined dynamic fields, and automatically defined as part of the creation of the point class). I think we really do need to get concrete about our options at this point. Agreed, code would be good. I had code (untested) just using dynamic fields... you changed it :-P But I meant actual fieldType and field definitions, and what fields get indexed as a result, and how type lookups on those fields happens. Fair enough!
Re: SOLR-1131 - Multiple Fields per Field Type
Proposal for handling points using only the field lookup mechanisms currently in place in IndexSchema: Option A: dynamic fields used for subfields, those dynamic fields need to be explicitly defined in the XML // needed to essentially define the point type fieldType name=latlon class=TrieDoubleFIeld precisionStep=8/ fieldType name=point subFieldSuffix=_latlon .../ dynamicField name=*_latlon type=latlon indexed=true stored=false/ // uses of the point type field name=home type=point/ dynamicField name=*_point type=point/ // subFieldSuffix is appended to the subFields indexed and thus those would be home__0_latlon home__1_latlon // And the indexed fields for dynamic field work_point would be work_point__0_latlon work_point__1_latlon // NOTE: this scheme works fine for subFields with different fieldTypes Option B: dynamic fields used for subfields, dynamic fields inserted into schema automatically // needed to essentially define the point type fieldType name=latlon class=TrieDoubleFIeld precisionStep=8/ fieldType name=point subFieldType=latlon/ // uses of the point type field name=home type=point/ dynamicField name=*_point type=point/ // A dynamic field is inserted into the schema by the point class of the form __subFieldTypeName by default. // This could be changed via an optional subFieldSuffix param on the point fieldType. double underscore used // to minimize collisions with user-defined dynamic fields. home_0__latlon home_1__latlon // And the indexed fields for dynamic field work_point would be work_point__0__latlon work_point__1__latlon // NOTE: this scheme works fine for subFields with different fieldTypes -Yonik http://www.lucidimagination.com
Re: SOLR-1131 - Multiple Fields per Field Type
OK, I'm fine w/ taking this type of approach, as opposed to the lookup mechanism I have. Of the two laid out below, there are pros and cons to both, as I see it. I'm inclined towards Option B. This keeps it hidden from the user, but doesn't require extra work for Solr. Let me code it up. On Dec 9, 2009, at 4:12 PM, Yonik Seeley wrote: Proposal for handling points using only the field lookup mechanisms currently in place in IndexSchema: Option A: dynamic fields used for subfields, those dynamic fields need to be explicitly defined in the XML // needed to essentially define the point type fieldType name=latlon class=TrieDoubleFIeld precisionStep=8/ fieldType name=point subFieldSuffix=_latlon .../ dynamicField name=*_latlon type=latlon indexed=true stored=false/ // uses of the point type field name=home type=point/ dynamicField name=*_point type=point/ // subFieldSuffix is appended to the subFields indexed and thus those would be home__0_latlon home__1_latlon // And the indexed fields for dynamic field work_point would be work_point__0_latlon work_point__1_latlon // NOTE: this scheme works fine for subFields with different fieldTypes Option B: dynamic fields used for subfields, dynamic fields inserted into schema automatically // needed to essentially define the point type fieldType name=latlon class=TrieDoubleFIeld precisionStep=8/ fieldType name=point subFieldType=latlon/ // uses of the point type field name=home type=point/ dynamicField name=*_point type=point/ // A dynamic field is inserted into the schema by the point class of the form __subFieldTypeName by default. // This could be changed via an optional subFieldSuffix param on the point fieldType. double underscore used // to minimize collisions with user-defined dynamic fields. home_0__latlon home_1__latlon // And the indexed fields for dynamic field work_point would be work_point__0__latlon work_point__1__latlon // NOTE: this scheme works fine for subFields with different fieldTypes
Re: SOLR-1131 - Multiple Fields per Field Type
Hi Yonik et al., I¹d like to add: Option C: Sub fields are specified as a attribute on the fieldType tag // needed to essentially define the point type fieldType name=latlon class=GeoPoint subFieldSuffix=_latlon ../ // uses of the latlon type field name=home type=latlon indexed=true stored=false/ // subFieldSuffix is appended to the subFields indexed and thus those would be: home_latlon_0 home_latlon_1 I like elements of Option B that you present below, however it seems to be mixing concerns. Type inheritance (aka your subFieldType attribute) seems to be orthogonal to poly fields -- a good idea, but another issue IMHO. Cheers, Chris On 12/9/09 1:12 PM, Yonik Seeley yo...@lucidimagination.com wrote: Proposal for handling points using only the field lookup mechanisms currently in place in IndexSchema: Option A: dynamic fields used for subfields, those dynamic fields need to be explicitly defined in the XML // needed to essentially define the point type fieldType name=latlon class=TrieDoubleFIeld precisionStep=8/ fieldType name=point subFieldSuffix=_latlon .../ dynamicField name=*_latlon type=latlon indexed=true stored=false/ // uses of the point type field name=home type=point/ dynamicField name=*_point type=point/ // subFieldSuffix is appended to the subFields indexed and thus those would be home__0_latlon home__1_latlon // And the indexed fields for dynamic field work_point would be work_point__0_latlon work_point__1_latlon // NOTE: this scheme works fine for subFields with different fieldTypes Option B: dynamic fields used for subfields, dynamic fields inserted into schema automatically // needed to essentially define the point type fieldType name=latlon class=TrieDoubleFIeld precisionStep=8/ fieldType name=point subFieldType=latlon/ // uses of the point type field name=home type=point/ dynamicField name=*_point type=point/ // A dynamic field is inserted into the schema by the point class of the form __subFieldTypeName by default. // This could be changed via an optional subFieldSuffix param on the point fieldType. double underscore used // to minimize collisions with user-defined dynamic fields. home_0__latlon home_1__latlon // And the indexed fields for dynamic field work_point would be work_point__0__latlon work_point__1__latlon // NOTE: this scheme works fine for subFields with different fieldTypes -Yonik http://www.lucidimagination.com ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
RE: SOLR-1131 - Multiple Fields per Field Type
: fieldType name=latlon type=LatLonFieldType pattern=location__* / : fieldType name=latlon_home type=LatLonFieldType pattern=location_home_*/ : fieldType name=latlon_work type=LatLonFieldType pattern=location_home_*/ : : field name=location type=latlon/ : field name=location_home type=latlon_home/ : field name=location_work type=latlon_work/ I'm not really understanding the value of an approach like that. for starters, what Lucene field names would ultimately be created in those examples? And if i also added... field name=other_location type=latlon/ dynamicField name=*_dynamic_location type=latlon/ ...then what field names would be created under the covers? : I think it makes more sense to define the heterogeneity at the fieldType level because: : : (a) it's a bit more consistent with the existing solr schema examples, : where the difference between many of the field types (e.g., ints and : tints, which are both solr.TrieIntField's, date and tdate, both : instances of solr.TrieDateField, with different configuration, etc.) : : (b) isolation of change: fieldType defs will change less often than : field defs, where names and indexed/stored/etc. debugging are likely : to occur more frequently ...this just feels wrong to me ... i can't really explain why. It seems like you are suggesting thatt every field/ declaration would need a one to one corrispondence with a unique fieldType/ declaration in order to prevent field name collisions, which sounds sketchy enough ... but i'm also not fond of the idea that a person editing the schema can't just look at the field/ and dynamicField/ names to ensure that they understand what underlying fields are being created (so they don't inadvertantly add a new one that collides) ... now they also have to look at the pattern attribute of every fieldType/ that is a poly field. letting dynamicField/ drive everything just seems a *lot* simpler ... both as far as implementation, and as far as maintaining the schema. : I don't think the above hybrid approach will lead to anything other than : confusion, as you indicated above. Let's stick to the pattern defs at : the fieldType level, and then let the fieldType handle the internal : dynamicity with e.g., a dynamicField, and then notify the schema user From the standpoint of reading a schema.xml file, the approach you're describing of a pattern attribute on fieldType/ declarations actaully seems more confusing then the strawman suggestion i made of a pattern attribute on field ... even without understanding what concrete feilds you are suggesting would be created with a configuration like that, it still increases the number of places you have to look to see what field names are getting created. -Hoss
Re: SOLR-1131 - Multiple Fields per Field Type
: I'm not sure if you worry about it. But I'd argue it isn't natural : anyway. You would do the following instead, which is how any address : book I've ever seen works: : field name=home type=LatLonFT/ : field name=work type=LatLonFT/ ...the home vs work distinction was arbitrary. the point is what if i want to support an arbitrary number of distinct values in a PolyField? ... with your approach any attempt to search for people near X would require me to search for work near X or home near X ... which is analogous to oneof hte main purposes of multivalued fields: so i don't have to uniquely name every Field instance. I might have a thousand unique (but unamed) locations that i want to associate with a document, and i want to search for documents with a location near X ... likewise i might have thousands unique polygons associated with a document and i want to search for documents where one or more polygons overlap with an input polygon (ie: island nations overlapping with the flight path of an airplane). The question is: how can/would PolyFields deal with input like this? .. we've discussed cardniality in the number of fields produced by a single input value, but we haven't really discussed cardinality in the number of input values. : So, maybe the FT can explicitly prohibit multivalued? But, I suppose : you could do the position thing, too. This could be achieved through a : new SpanQuery pretty easily: SpanPositionQuery that takes in a term and : a specific position. Trivial to write, I think, just not sure if it is : generally useful. Although, I must say I've been noodling around with The problem is how do you let the PolyField specify the position when indexing? the last API i saw fleshed out in this discussion didn't give the PolyField any information about how many input values were in any given doc, it just allowed PolyFields to be String=Field[] black boxes (as opposed to the String=Field[] black box FieldTYpes must currently be). We can't assume even basic lastPostion+1 type logic for these polyfields, because differnet input values might produce Filed arrays containing different quantities of fields, with differnet names. if a CartiesienPolyFieldType can get away with only using the grid_level1 and grid_level2 fields for one input value, but other input values require using grid_level2, grid_level2, and grid_level3, then simple position increments aren't enough if a document has multiple values (some of which need 2 different Field names, and others that need 3) -Hoss
Re: SOLR-1131 - Multiple Fields per Field Type
On Dec 7, 2009, at 5:59 PM, Chris Hostetter wrote: : fieldType name=latlon type=LatLonFieldType pattern=location__* / : fieldType name=latlon_home type=LatLonFieldType pattern=location_home_*/ : fieldType name=latlon_work type=LatLonFieldType pattern=location_home_*/ : : field name=location type=latlon/ : field name=location_home type=latlon_home/ : field name=location_work type=latlon_work/ I'm not really understanding the value of an approach like that. for starters, what Lucene field names would ultimately be created in those examples? And if i also added... Have a look at the patch I put up today. I think it is going to work quite well, but that could be jet-lag induced delirium at this point. For a field type: fieldType name=point type=solr.PointType dimension=2 subFieldType=double/ and a field declared as: field name=home type=point indexed=true stored=true/ And a new document of: doc field name=point39.0 -79.434/field /doc There are three fields created: home -- Contains the stored value home___0 - Contains 39.0 indexed as a double (as in the double FieldType, not just a double precision) home___1 - Contains -79.434 as a double field name=other_location type=latlon/ dynamicField name=*_dynamic_location type=latlon/ ...then what field names would be created under the covers? : I think it makes more sense to define the heterogeneity at the fieldType level because: : : (a) it's a bit more consistent with the existing solr schema examples, : where the difference between many of the field types (e.g., ints and : tints, which are both solr.TrieIntField's, date and tdate, both : instances of solr.TrieDateField, with different configuration, etc.) : : (b) isolation of change: fieldType defs will change less often than : field defs, where names and indexed/stored/etc. debugging are likely : to occur more frequently ...this just feels wrong to me ... i can't really explain why. It seems like you are suggesting thatt every field/ declaration would need a one to one corrispondence with a unique fieldType/ declaration in order to prevent field name collisions, which sounds sketchy enough ... but i'm also not fond of the idea that a person editing the schema can't just look at the field/ and dynamicField/ names to ensure that they understand what underlying fields are being created (so they don't inadvertantly add a new one that collides) ... now they also have to look at the pattern attribute of every fieldType/ that is a poly field. letting dynamicField/ drive everything just seems a *lot* simpler ... both as far as implementation, and as far as maintaining the schema. I don't agree. It requires more configuration and more knowledge by the end user and doesn't hid the details.
Re: SOLR-1131 - Multiple Fields per Field Type
On Dec 7, 2009, at 6:13 PM, Chris Hostetter wrote: : I'm not sure if you worry about it. But I'd argue it isn't natural : anyway. You would do the following instead, which is how any address : book I've ever seen works: : field name=home type=LatLonFT/ : field name=work type=LatLonFT/ ...the home vs work distinction was arbitrary. the point is what if i want to support an arbitrary number of distinct values in a PolyField? This is the beauty of Yonik's addition of getFieldQuery() to the FieldType. The FieldType will be aware of the arbitrariness. Furthermore, it can reflect on the index itself via IndexReader.getFieldNames() to determine the number of Fields that actually exist if it has to. However, my guess is that in practice in most situations the FieldType author/user will have the info it needs. Still, I think we can also evolve if we need to. ... with your approach any attempt to search for people near X would require me to search for work near X or home near X ... which is analogous to oneof hte main purposes of multivalued fields: so i don't have to uniquely name every Field instance. Sure, but would you really ever model multiple locations like that in the same field? I don't think in practice that you would, so I think it is a bit of a red herring. Perhaps there is a different use case that better demonstrates it? I might have a thousand unique (but unamed) locations that i want to associate with a document, and i want to search for documents with a location near X ... likewise i might have thousands unique polygons associated with a document and i want to search for documents where one or more polygons overlap with an input polygon (ie: island nations overlapping with the flight path of an airplane). I don't think this implementation precludes that. The FunctionQueries only operating on a single valued field does, however. Setting that aside, we could write a Query that does what you want, I think. The question is: how can/would PolyFields deal with input like this? .. we've discussed cardniality in the number of fields produced by a single input value, but we haven't really discussed cardinality in the number of input values. I'm not sure that it does, but I don't know that it needs to just yet. This might be where an R-Tree implementation comes in handy, but I'll leave it to the geo-experts to discuss more. I also am not sure how the PolyField case is any different than the dynamic field case. Either way, something needs to know the names of the fields that were created. : So, maybe the FT can explicitly prohibit multivalued? But, I suppose : you could do the position thing, too. This could be achieved through a : new SpanQuery pretty easily: SpanPositionQuery that takes in a term and : a specific position. Trivial to write, I think, just not sure if it is : generally useful. Although, I must say I've been noodling around with The problem is how do you let the PolyField specify the position when indexing? the last API i saw fleshed out in this discussion didn't give the PolyField any information about how many input values were in any given doc, it just allowed PolyFields to be String=Field[] black boxes (as opposed to the String=Field[] black box FieldTYpes must currently be). We can't assume even basic lastPostion+1 type logic for these polyfields, because differnet input values might produce Filed arrays containing different quantities of fields, with differnet names. if a CartiesienPolyFieldType can get away with only using the grid_level1 and grid_level2 fields for one input value, but other input values require using grid_level2, grid_level2, and grid_level3, then simple position increments aren't enough if a document has multiple values (some of which need 2 different Field names, and others that need 3) That's not how the Cartesian Field stuff works, but I think I see what you are getting at and I would say I'm going to explicitly punt on that right now. Ultimately, I think when such a case comes up, the FieldType needs to be configured to be able to determine this information. -Grant
Re: SOLR-1131 - Multiple Fields per Field Type
Hi Hoss, : fieldType name=latlon type=LatLonFieldType pattern=location__* / : fieldType name=latlon_home type=LatLonFieldType pattern=location_home_*/ : fieldType name=latlon_work type=LatLonFieldType pattern=location_home_*/ : : field name=location type=latlon/ : field name=location_home type=latlon_home/ : field name=location_work type=latlon_work/ I'm not really understanding the value of an approach like that. for starters, what Lucene field names would ultimately be created in those examples? The first field would be named location__location. The second field would be named location_home_location_home. The third field would be named location_work_location_work. And if i also added... field name=other_location type=latlon/ dynamicField name=*_dynamic_location type=latlon/ ...then what field names would be created under the covers? In general, it would be FieldType#getPattern().stripOffEndRegexStarStuff() + Field#getName(). : I think it makes more sense to define the heterogeneity at the fieldType level because: : : (a) it's a bit more consistent with the existing solr schema examples, : where the difference between many of the field types (e.g., ints and : tints, which are both solr.TrieIntField's, date and tdate, both : instances of solr.TrieDateField, with different configuration, etc.) : : (b) isolation of change: fieldType defs will change less often than : field defs, where names and indexed/stored/etc. debugging are likely : to occur more frequently ...this just feels wrong to me ... i can't really explain why. It seems like you are suggesting thatt every field/ declaration would need a one to one corrispondence with a unique fieldType/ declaration in order to prevent field name collisions, which sounds sketchy enough ... but i'm also not fond of the idea that a person editing the schema can't just look at the field/ and dynamicField/ names to ensure that they understand what underlying fields are being created (so they don't inadvertantly add a new one that collides) ... now they also have to look at the pattern attribute of every fieldType/ that is a poly field. Well if this feels wrong to you then I think the schema.xml file that ships with SOLR should also feel wrong as well because it uses the exact same pattern for defining field type variations. That is, differences between FieldType representations for ints and tints are not stored as variations on the SchemaField definition itself but they are stored as variation on the FieldTypes (e.g., a different precisionStep in the case of int [0] versus that of tint [8]). Based on what you are proposing, why isn't precisionStep an attribute on field, rather than fieldType in those examples? letting dynamicField/ drive everything just seems a *lot* simpler ... both as far as implementation, and as far as maintaining the schema. Possibly. It's also a lot less traceable. It's implicit versus explicit, which I'm not sure leads to simplicity in the end. : I don't think the above hybrid approach will lead to anything other than : confusion, as you indicated above. Let's stick to the pattern defs at : the fieldType level, and then let the fieldType handle the internal : dynamicity with e.g., a dynamicField, and then notify the schema user From the standpoint of reading a schema.xml file, the approach you're describing of a pattern attribute on fieldType/ declarations actaully seems more confusing then the strawman suggestion i made of a pattern attribute on field ... even without understanding what concrete feilds you are suggesting would be created with a configuration like that, it still increases the number of places you have to look to see what field names are getting created. How so? In actuality, it reduces it. Instead of having pattern definitions on fields (which there is a greater chance of having more of), you have them on field types? Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
RE: SOLR-1131 - Multiple Fields per Field Type
Hi Grant, On 12/02/2009 at 2:30 PM, Grant Ingersoll wrote: I've been noodling around with the idea with the notion of a layered field where variants of a primary token are stored at sub positions of the primary token (instead of in separate copy fields) The Indri search engine (now part of Lemur) uses a similar idea: fields are implemented as potentially overlapping extents over the (single) stream of document tokens. (Howard Turtle, who is now the CNLP director, and has been involved in Indri development, told me about this feature. He says it allows for natural representation of fields projected onto hierarchical data, e.g. XML.) I wasn't able to find much documentation about this online when I looked just now, but here's a high-level overview of the Indri repository (aka index) structure: http://www.lemurproject.org/docs/index.php/Indri_Repository_Structure (See the Field Information Files section near the bottom.) Steve
Re: SOLR-1131 - Multiple Fields per Field Type
And this is also an approach Yonik drafted here for user/tagging design: http://wiki.apache.org/solr/UserTagDesign Erik On Dec 4, 2009, at 1:35 PM, Steven A Rowe wrote: Hi Grant, On 12/02/2009 at 2:30 PM, Grant Ingersoll wrote: I've been noodling around with the idea with the notion of a layered field where variants of a primary token are stored at sub positions of the primary token (instead of in separate copy fields) The Indri search engine (now part of Lemur) uses a similar idea: fields are implemented as potentially overlapping extents over the (single) stream of document tokens. (Howard Turtle, who is now the CNLP director, and has been involved in Indri development, told me about this feature. He says it allows for natural representation of fields projected onto hierarchical data, e.g. XML.) I wasn't able to find much documentation about this online when I looked just now, but here's a high-level overview of the Indri repository (aka index) structure: http://www.lemurproject.org/docs/index.php/Indri_Repository_Structure (See the Field Information Files section near the bottom.) Steve
Re: SOLR-1131 - Multiple Fields per Field Type
On Dec 1, 2009, at 1:42 AM, Chris Hostetter wrote: It feels like something we've overlooked in this discussion is whether we need to worry about any FieldType API changes needed to make these new PolyField classes aware of when they are multivalued. The API suggestions grant made gives the FieldTYpe the ability to return a Filed[] from a single field value input -- but it doesn't provide any information about wether that field value is one of many values we're indexing for this field name. Imagine that i want to make an index of people i know. Each person also has multiple locations where they can frequently be found (home, work, gym, girlfriends house, favorite coffee shop, etc..). My common case is to search for people, not locations, so it doesn't make sense to flatten out and have a doc for each person+location, i just want a single doc per person, but htat means i need a locations field that's multivalued. If i'm using a simple LatLonFieldType that splits my comma seperated coordinate string into a locations__LAT and a locations__LON field then iassume it needs to do something special in the multiValued case to make sure later near searches don't get confused and think that the lat from my work and the lon from my home are actaully a third location. how do we solve this? I'm not sure if you worry about it. But I'd argue it isn't natural anyway. You would do the following instead, which is how any address book I've ever seen works: field name=home type=LatLonFT/ field name=work type=LatLonFT/ So, maybe the FT can explicitly prohibit multivalued? But, I suppose you could do the position thing, too. This could be achieved through a new SpanQuery pretty easily: SpanPositionQuery that takes in a term and a specific position. Trivial to write, I think, just not sure if it is generally useful. Although, I must say I've been noodling around with the idea with the notion of a layered field where variants of a primary token are stored at sub positions of the primary token (instead of in separate copy fields) and then one could write a query that says, for instance, search all of the secondary terms. So, for instance, if you think of each position containing a stack of terms, then you could say use the terms at position two in the stack. I'm not quite sure what this means just yet, but my thinking is that I could get a really compact index at the cost of a slightly more complex query. It also means I would do some interesting things at query time that simply cannot be done across fields at the moment, for instance, create a phrase type query that used different layers where appropriate. -Grant
Re: SOLR-1131 - Multiple Fields per Field Type
: Maybe, but something needs that logic. Think relational database -- if you : try and add a field to a schema (e.g., using some DBMS client GUI or vanilla : command line SQL) where that name already exists, then you get a SQL : exception. Similarly, SOLR should support such concepts. Maybe it doesn't go ... : How are you screwed? If you add a field name that collides then as long as : the FieldType checked you'd still be OK? Maybe FieldTypes that support : multi-internal fields should have a requirement that they be configured from : the schema.xml file themselves, so that the user configuring the entire : schema can be made to deal with the namespacing at that level -- then if she ...but these are just two different solutions that illustrate my overall point: one way or another, the person editing the schema.xml file needs to know that these FieldTypes are going to be adding additional fields to the index and has to be aware of the posibility of field name collisions -- either because they are required to configure what those additional fieldnames are for these FieldTypes to prevent collisions, or because they have to know how to make sense of errors that might get logged if the system detects a collision. So rather then try to make it entirely magical and behind the scnes, and still require them to know about it if a collision happens and they get an error, let's put it right out in front of them so they know about it and think it through. if people feel that something like this... fieldType name=latlon type=LatLonFieldType / dynamicField name=location* type=latlon / ...where an end user can deal with these fields... location location_home location_work ...and under the covers the field type uses... location__LAT + location__LON location_home__LAT + location_home__LON location_work__LAT + location_work__LON ...is an abuse of the dynamicField/ syntax, then we could accomplish the same thing with something like... fieldType name=latlon type=LatLonFieldType / field name=location type=latlon pattern=location__* / field name=location_home type=latlon pattern=location_home__* / field name=location_work type=latlon pattern=location_work__* / ...but that would be more verbose, and would be somewhat confusing to try and use as a true dynamicField (ie: we want to support home, work and anything else picked at run time)... fieldType name=latlon type=LatLonFieldType / field name=location type=latlon pattern=location* / dynamicfield name=location_* type=latlon pattern=??whagoeshere?? / so why not just leverage the existing dynamicFieldsyntax/mechanism where schema creators already expect fields to be created at runtime, and already have to think about possible name collisions? -Hoss
Re: SOLR-1131 - Multiple Fields per Field Type
It feels like something we've overlooked in this discussion is whether we need to worry about any FieldType API changes needed to make these new PolyField classes aware of when they are multivalued. The API suggestions grant made gives the FieldTYpe the ability to return a Filed[] from a single field value input -- but it doesn't provide any information about wether that field value is one of many values we're indexing for this field name. Imagine that i want to make an index of people i know. Each person also has multiple locations where they can frequently be found (home, work, gym, girlfriends house, favorite coffee shop, etc..). My common case is to search for people, not locations, so it doesn't make sense to flatten out and have a doc for each person+location, i just want a single doc per person, but htat means i need a locations field that's multivalued. If i'm using a simple LatLonFieldType that splits my comma seperated coordinate string into a locations__LAT and a locations__LON field then iassume it needs to do something special in the multiValued case to make sure later near searches don't get confused and think that the lat from my work and the lon from my home are actaully a third location. how do we solve this? I suppose we could just rely on mathing termPosition information, but that means the FieldType needs a way to specify the Analyzer for all of the field names it creates on the fly (another argument for reusing dynamicFields i guess) to specify matching increments -- but that seems somewhat brittle: what about complex PolyFieldTypes that want to create variable number of Field's based on the input? ie: as i recall, if you want to index coordinates of polygon bounding boxes using cartisien grid fields, you need more field names for big polygons then you do for small polygons -- so what if someone wants a multivalued PolyField and indexes very big and very small polygons? ... termPositions doens't seem like it really cuts it here. -Hoss
RE: SOLR-1131 - Multiple Fields per Field Type
Hey Hoss, So rather then try to make it entirely magical and behind the scnes, and still require them to know about it if a collision happens and they get an error, let's put it right out in front of them so they know about it and think it through. +1 to that -- was never trying to make anything magical, just to point out that there were a number of different solutions here, not all of which are orthogonal (as you pointed out above, SOLR may use a combination of intuitive log messages + explicit collision handling in code, not just one or the other). if people feel that something like this... fieldType name=latlon type=LatLonFieldType / dynamicField name=location* type=latlon / ...where an end user can deal with these fields... location location_home location_work ...and under the covers the field type uses... location__LAT + location__LON location_home__LAT + location_home__LON location_work__LAT + location_work__LON ...is an abuse of the dynamicField/ syntax, then we could accomplish the same thing with something like... fieldType name=latlon type=LatLonFieldType / field name=location type=latlon pattern=location__* / field name=location_home type=latlon pattern=location_home__* / field name=location_work type=latlon pattern=location_work__* / Now you're talking. I like this option, with the following updates: fieldType name=latlon type=LatLonFieldType pattern=location__* / fieldType name=latlon_home type=LatLonFieldType pattern=location_home_*/ fieldType name=latlon_work type=LatLonFieldType pattern=location_home_*/ field name=location type=latlon/ field name=location_home type=latlon_home/ field name=location_work type=latlon_work/ I think it makes more sense to define the heterogeneity at the fieldType level because: (a) it's a bit more consistent with the existing solr schema examples, where the difference between many of the field types (e.g., ints and tints, which are both solr.TrieIntField's, date and tdate, both instances of solr.TrieDateField, with different configuration, etc.) (b) isolation of change: fieldType defs will change less often than field defs, where names and indexed/stored/etc. debugging are likely to occur more frequently ...but that would be more verbose, and would be somewhat confusing to try and use as a true dynamicField (ie: we want to support home, work and anything else picked at run time)... fieldType name=latlon type=LatLonFieldType / field name=location type=latlon pattern=location* / dynamicfield name=location_* type=latlon pattern=??whagoeshere?? / I don't think the above hybrid approach will lead to anything other than confusion, as you indicated above. Let's stick to the pattern defs at the fieldType level, and then let the fieldType handle the internal dynamicity with e.g., a dynamicField, and then notify the schema user by providing: (1) a nice intuitive set of documentation with the poly field types that says: don't use these reserved field names in your schema if you are using this field type in any of your field instances (the concept is the same as in P/L's -- you can declare variables named for or int, etc.); and (2) intuitive error msgs and exceptions if the schema user insists on ignoring the poly field documentation. so why not just leverage the existing dynamicFieldsyntax/mechanism where schema creators already expect fields to be created at runtime, and already have to think about possible name collisions? I think we should leverage dynamicFields, but maybe not explicitly. Then you have to maintain the poly field def as both a dynamicField and fieldType, which IMHO is not as elegant as multiple field type def (configured instances of the same field type) with the pattern param you suggested, coupled with field declarations that use those fieldType configured instances. Cheers, Chris
RE: SOLR-1131 - Multiple Fields per Field Type
Hey Hoss, From: Chris Hostetter [hossman_luc...@fucit.org] Sent: Monday, November 30, 2009 5:42 PM To: solr-dev@lucene.apache.org Subject: Re: SOLR-1131 - Multiple Fields per Field Type It feels like something we've overlooked in this discussion is whether we need to worry about any FieldType API changes needed to make these new PolyField classes aware of when they are multivalued. The API suggestions grant made gives the FieldTYpe the ability to return a Filed[] from a single field value input -- but it doesn't provide any information about wether that field value is one of many values we're indexing for this field name. Imagine that i want to make an index of people i know. Each person also has multiple locations where they can frequently be found (home, work, gym, girlfriends house, favorite coffee shop, etc..). My common case is to search for people, not locations, so it doesn't make sense to flatten out and have a doc for each person+location, i just want a single doc per person, but htat means i need a locations field that's multivalued. If i'm using a simple LatLonFieldType that splits my comma seperated coordinate string into a locations__LAT and a locations__LON field then iassume it needs to do something special in the multiValued case to make sure later near searches don't get confused and think that the lat from my work and the lon from my home are actaully a third location. how do we solve this? I suppose we could just rely on mathing termPosition information, but that means the FieldType needs a way to specify the Analyzer for all of the field names it creates on the fly (another argument for reusing dynamicFields i guess) * or, alternatively, fieldTypes with configured pattern params * to specify matching increments -- but that seems somewhat brittle: what about complex PolyFieldTypes that want to create variable number of Field's based on the input? * This would seem to argue for smart FieldTypes that understand how their information is persisted (not just pattern parameters), but perhaps something that's difficult to codify in XML versus an actual P/L. Increments might be the only variant, but there may be more * ie: as i recall, if you want to index coordinates of polygon bounding boxes using cartisien grid fields, you need more field names for big polygons then you do for small polygons -- so what if someone wants a multivalued PolyField and indexes very big and very small polygons? ... termPositions doens't seem like it really cuts it here. * good food for thought -- I'll sleep on it tonight and see what I can think of to add to the discussion...* Cheers, Chris
Re: SOLR-1131 - Multiple Fields per Field Type
On Nov 28, 2009, at 7:37 PM, Chris Hostetter wrote: : I don't think it's useful to somehow programmatically access the list : of fields that a fieldType could output. based on my understanding of the potential types of use cases we're talking about, i think i agree with you. It seems like the most crucial aspect is that a FieldType has a way of producing multiple o.a.l.document.Field instances (potentially with different field names) from a single String input at index time. this can be done with something like the API that Grant mentioned earlier. For anything except non-trivial use cases, any code (like a query parser) attempting to deal with this fields is going to need to be very special purpose and have direct knowledge of the code in the FieldType. if a CartesienGeoSearchQParser is asked to parse store_location:89.3,45.4~5miles it can throw a parse exception if IndexSchema.getFieldType(store_location) isn't an instanceof CartesianGeoSearchFieldType -- assuming it is: it can cast it and call CartesianGeoSearchFieldType specific methods to find out everything it needs to know about what multitudes of field names that specific instance produced based on it's configuration. (side thought: we may want to add a getFieldTypesByClass method to the IndexSchema so QParsers and SearchComponents can get lists of fields matching special cases they want to know about -- but that's a secondary concern) One thing that concerns me is potential field name collision -- where one of these new multifield producing FieldTypes might want to creat a name that happens to collide with a field the user has already declared. Using Double underscores kind of feels like a hack, what i keep wondering is if we can't leverage dynamicFields here. if we require that these FieldTypes be declared using dynamicField delcarations (they could error on init otherwise) then the wildcard nature of the name tells the FieldType where it's allowed to add things to the pattern to make unique field names in the index -- and they can still be used as true dynamic fields, as long as they always add to the field name given to them. something like... dynamicField name=location* type=geo1 / I thought about this too. It is what Local Solr currently does (although it expects a certain prefix, too, I believe). However, it seems a bit unnecessary, as now the user needs to use both the field type and the dynamic field in order to get it to work, whereas I don't think they should have to do that, as it isn't in line with the notion of a field type. FieldTypes currently can be used for any fields, both regular and dynamic.
Re: SOLR-1131 - Multiple Fields per Field Type
: One thing that concerns me is potential field name collision -- where one : of these new multifield producing FieldTypes might want to creat a name : that happens to collide with a field the user has already declared. : : Since FieldTypes are provided an instance of o.a.solr.schema.IndexScehma, : couldn't these special MultiFieldTypes just call #getFieldOrNull(fieldName) : to find out if an internal field they want to name is available in the : schema namespace? that would tell them if a field name is currently in use, but not what to do about it if it is already in use -- FieldType classes shouldn't need complicated hueristics to figure out somethign the user could configure. Even if the FieldType's where crazy smart -- it wouldn't provide any future-proofing. a FieldType could inspect the schema and see that certain fieldnames aren't in use right now, but then i could change my schema and add a field name that *does* collide with something it previously picked, and now i'm screwed. -Hoss
Re: SOLR-1131 - Multiple Fields per Field Type
: I thought about this too. It is what Local Solr currently does : (although it expects a certain prefix, too, I believe). However, it : seems a bit unnecessary, as now the user needs to use both the field : type and the dynamic field in order to get it to work, whereas I don't : think they should have to do that, as it isn't in line with the notion : of a field type. FieldTypes currently can be used for any fields, both : regular and dynamic. we already have FieldTypes that only make sense when used as either a field/ or as a dynamicField/ ... RandomField only makes sense when used as a dynamicField, ExternalValueField doesn't make sense if you try to use it as a dynamicField -- it's just hte nature of specialized FieldTypes. It's one thing to say that we don't want search/index users to have to know about the details of how these fields work -- i agree with that, they should just be be able to index and query against a location field and have it work, without knowing that location actually builds up a bunch of cartisien grid fields using names like location_0DAB9 ... but i think it's perfectly acceptible to ask that the schema creator / solr addministrator have som understanding of these special field types, and to tell them you need to declare these as dynamicField/ because they add other low level fields using that prefix/suffix that you don't need to worry about. The admin type users are going to need to know about these automagically created fields one way or another -- if not to prevent collision, then to make sure they don't get confused when they look at Luke and the schema browser. -Hoss
Re: SOLR-1131 - Multiple Fields per Field Type
Hi Hoss, On 11/29/09 12:22 PM, Chris Hostetter hossman_luc...@fucit.org wrote: that would tell them if a field name is currently in use, but not what to do about it if it is already in use -- FieldType classes shouldn't need complicated hueristics to figure out somethign the user could configure. Maybe, but something needs that logic. Think relational database -- if you try and add a field to a schema (e.g., using some DBMS client GUI or vanilla command line SQL) where that name already exists, then you get a SQL exception. Similarly, SOLR should support such concepts. Maybe it doesn't go in FieldType (though I'm not convinced of that since I don't think it's as complicated as is implied), but at the very least it should go into IndexSchema. Even if the FieldType's where crazy smart -- it wouldn't provide any future-proofing. a FieldType could inspect the schema and see that certain fieldnames aren't in use right now, but then i could change my schema and add a field name that *does* collide with something it previously picked, and now i'm screwed. How are you screwed? If you add a field name that collides then as long as the FieldType checked you'd still be OK? Maybe FieldTypes that support multi-internal fields should have a requirement that they be configured from the schema.xml file themselves, so that the user configuring the entire schema can be made to deal with the namespacing at that level -- then if she messes up the multi-field configuration, it's on them (with the help of SOLR reporting a helpful exception and area in the schema where the collision occurred)... Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: SOLR-1131 - Multiple Fields per Field Type
What about rather than conflating field types for creating multiple fields, use update processors to do the this expansion instead? Erik On Nov 26, 2009, at 10:04 AM, Grant Ingersoll wrote: On Nov 25, 2009, at 8:24 PM, Chris Hostetter wrote: I'm having a hard time wrapping my head arround this entire concept ... i know part of my problem is that your example use case seems somewhat nonsensical... : As a simple proof of concept, imagine that I define a new FieldType : called PlusMinusIntFieldType that extends IntField. This FieldType : takes in an int value and outputs two Fields: one with the original : value and one with the negative of the value. ... : OK, on the search side is where it gets tricky. The whole point of this : exercise is that the details are hidden from the user in the generic : case. Thus, a query of plusMinus:5 should automatically expand to : (plusMinus__0:5 OR plusMinus__1:-5). Of course, an expert user should ...nothing could match plusMinus__0:5 that didn't also match plusMinus__1:-5, so i don't really understand what the point of using the field expansion for a use case like this would be ... and that's making it hard for me to try and understand how this sort of system could/should/would be used at query time. Kind of, if a user just inputs plusMinus:5, then sure, but they may also want to just search the negative portion. More importantly, though, they may have a QParser or some other component that can appropriately select one of the fields w/o the user knowing. perhaps a more realistic example would be helpful? ...or even some differnet simple and contrived examples that demonstrate how this could be usefull in a way that isn't possible with a single field. OK, a more concrete example is spatial. A user will want to index a point as a lat lon. So, they index: field name=latLon49, -79/ field. The implementation of how this gets indexed can be done in several ways. For starters, it can be represented as a single field using Geohash or even just as a string (even if that isn't useful for much). We don't need S-1131 for that at all. Next, they may just want to represent it as a two fields: one for the lat and one for the long. Again, not super hard to do now, but it requires the user to set it up, whereas with a LatLonFieldType, this would be hidden from them. Finally, consider the cartesian tier case. In this case, a single lat lon point could be mapped to a whole slew of tiers, where each tier is like a zoom level on a map application (like Google Maps). Here, we could have a CartesianTierFieldType, that takes in the lower and upper bounds of the tiers to represent, i.e. tier 4 through 17, and this would output 13 different fields.Local Solr currently handles this through dynamic fields and user level knowledge of the magic fields used. For this case, there are several different search patterns: 1. The user may know the tier they want to search at and thus input tier and a zoom level. 2. User invokes a QParser to build a bounding box (see https://issues.apache.org/jira/browse/SOLR-1568) and the Parser is responsible for creating a filter that chooses the most appropriate tier to search against. So, the user might just say: {!tier lat=X lon=Y dist=10} and it will pick the most appropriate tier, whereas putting in dist=50 would likely pick a different tier. Does that help? BTW, all of this is tracked via SOLR-773.
Re: SOLR-1131 - Multiple Fields per Field Type
On Nov 28, 2009, at 3:45 AM, Erik Hatcher wrote: What about rather than conflating field types for creating multiple fields, use update processors to do the this expansion instead? How do you maintain the semantic information needed at search time? Are you still having the field type (or schema or something accessible by search) be aware of the change?
Re: SOLR-1131 - Multiple Fields per Field Type
Hi, Aren't search semantics the responsibility of a Query Parser and Querys themselves? Just as the semantics of boolean queries are handled by the standard Query parsers and BooleanQuery. On Sat, Nov 28, 2009 at 3:17 PM, Grant Ingersoll gsing...@apache.orgwrote: On Nov 28, 2009, at 3:45 AM, Erik Hatcher wrote: What about rather than conflating field types for creating multiple fields, use update processors to do the this expansion instead? How do you maintain the semantic information needed at search time? Are you still having the field type (or schema or something accessible by search) be aware of the change?
Re: SOLR-1131 - Multiple Fields per Field Type
On Sat, Nov 28, 2009 at 9:41 AM, Chris Male gento...@gmail.com wrote: Aren't search semantics the responsibility of a Query Parser and Querys themselves? Just as the semantics of boolean queries are handled by the standard Query parsers and BooleanQuery. At a certain point, one needs polymorphic behavior to do the right thing (unless you hard-code all field type info into the query parser). This is already done via fieldType to control how range queries, function queries, sort fields, etc, are created. We *could* encode all of the info for both lat/lon in a single field, but it would be more work since Lucene fieldcaches, numeric range queries, etc, don't support that. Practically, it seems easiest to allow a single fieldType to use more than one internal field. -Yonik http://www.lucidimagination.com
Re: SOLR-1131 - Multiple Fields per Field Type
Hi, There is some standardization of the syntax and semantics of range queries, function queries and sorting that exists outside of the field types themselves though. For example for range queries FieldType expects there is just 2 values that define the range I think. Thats a requirement that is enforced by the Query Parser. By allowing each FieldType to have its own search semantics, you are going to have to let them do their own parsing too. For Grant's example of a PlusMinus kind of field, its possible to support it through term query like syntax so no custom parsing has to occur, but for other types of fields that have multiple fields, that might not be possible. In these situations is a custom Query Parser going to be necessary? On Sat, Nov 28, 2009 at 4:35 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Sat, Nov 28, 2009 at 9:41 AM, Chris Male gento...@gmail.com wrote: Aren't search semantics the responsibility of a Query Parser and Querys themselves? Just as the semantics of boolean queries are handled by the standard Query parsers and BooleanQuery. At a certain point, one needs polymorphic behavior to do the right thing (unless you hard-code all field type info into the query parser). This is already done via fieldType to control how range queries, function queries, sort fields, etc, are created. We *could* encode all of the info for both lat/lon in a single field, but it would be more work since Lucene fieldcaches, numeric range queries, etc, don't support that. Practically, it seems easiest to allow a single fieldType to use more than one internal field. -Yonik http://www.lucidimagination.com
Re: SOLR-1131 - Multiple Fields per Field Type
On Sat, Nov 28, 2009 at 10:51 AM, Chris Male gento...@gmail.com wrote: By allowing each FieldType to have its own search semantics We're far enough removed from an actual feature, I'm not sure if we're disagreeing about anything concrete :-) Going back to Grant's original question, I think it's just a matter of documentation for specific fieldTypes. A certain fieldType like geo1 could index the lat and lon as separate numeric fields. That would be a specific behavior to that fieldType that users would know about if they wanted to construct elementary queries on lat or lon only. That would not be supported in all geo types of course - it's an implementation detail that may or may not be part of the interface... it depends on what makes sense for the particular fieldType. I don't think it's useful to somehow programmatically access the list of fields that a fieldType could output. -Yonik http://www.lucidimagination.com
Re: SOLR-1131 - Multiple Fields per Field Type
: I don't think it's useful to somehow programmatically access the list : of fields that a fieldType could output. based on my understanding of the potential types of use cases we're talking about, i think i agree with you. It seems like the most crucial aspect is that a FieldType has a way of producing multiple o.a.l.document.Field instances (potentially with different field names) from a single String input at index time. this can be done with something like the API that Grant mentioned earlier. For anything except non-trivial use cases, any code (like a query parser) attempting to deal with this fields is going to need to be very special purpose and have direct knowledge of the code in the FieldType. if a CartesienGeoSearchQParser is asked to parse store_location:89.3,45.4~5miles it can throw a parse exception if IndexSchema.getFieldType(store_location) isn't an instanceof CartesianGeoSearchFieldType -- assuming it is: it can cast it and call CartesianGeoSearchFieldType specific methods to find out everything it needs to know about what multitudes of field names that specific instance produced based on it's configuration. (side thought: we may want to add a getFieldTypesByClass method to the IndexSchema so QParsers and SearchComponents can get lists of fields matching special cases they want to know about -- but that's a secondary concern) One thing that concerns me is potential field name collision -- where one of these new multifield producing FieldTypes might want to creat a name that happens to collide with a field the user has already declared. Using Double underscores kind of feels like a hack, what i keep wondering is if we can't leverage dynamicFields here. if we require that these FieldTypes be declared using dynamicField delcarations (they could error on init otherwise) then the wildcard nature of the name tells the FieldType where it's allowed to add things to the pattern to make unique field names in the index -- and they can still be used as true dynamic fields, as long as they always add to the field name given to them. something like... dynamicField name=location* type=geo1 / can be use to index a single location field (internal construction location_lat and location_lon or it could be used to support a location_start and location_end field (using location_start_lat+location_start_lon and location_end_lat+location_end_lon) -Hoss
Re: SOLR-1131 - Multiple Fields per Field Type
Hey Hoss, On 11/28/09 4:37 PM, Chris Hostetter hossman_luc...@fucit.org wrote: One thing that concerns me is potential field name collision -- where one of these new multifield producing FieldTypes might want to creat a name that happens to collide with a field the user has already declared. Since FieldTypes are provided an instance of o.a.solr.schema.IndexScehma, couldn't these special MultiFieldTypes just call #getFieldOrNull(fieldName) to find out if an internal field they want to name is available in the schema namespace? Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: SOLR-1131 - Multiple Fields per Field Type
On Sat, Nov 28, 2009 at 7:37 PM, Chris Hostetter hossman_luc...@fucit.org wrote: Using Double underscores kind of feels like a hack, what i keep wondering is if we can't leverage dynamicFields here. This is what the prototype patch does I just put up on SOLR-1131. I've gone with option A for now (see cut'n'pated comments from DualPointType) but it's up to the specific FieldType. /** * Two possible ways to allow the user to specify the __lat and __lon fields: * * A: let them specify the complete suffix for the lat field and lon field. It * is up to them to make sure those dynamic field types are defined in the schema. * Advantages: no new mechanism needed to add fieldTypes during the initialization * of another fieldType. * Disadvantages: more clutter in the schema, lack of control over what fieldTypes * are used, need to delegate absolutely everything through the subFieldType since * we don't actually know what it is. * * B: Have the TriePointType create and insert the fieldtypes for the lat and lon * fields itself. * Advantages: less clutter in the schema, more control over the exact numeric field * type. * Disadvantages: dynamically adding new types not currently supported, * less customizability * */ -Yonik http://www.lucidimagination.com
Re: SOLR-1131 - Multiple Fields per Field Type
On Nov 25, 2009, at 8:24 PM, Chris Hostetter wrote: I'm having a hard time wrapping my head arround this entire concept ... i know part of my problem is that your example use case seems somewhat nonsensical... : As a simple proof of concept, imagine that I define a new FieldType : called PlusMinusIntFieldType that extends IntField. This FieldType : takes in an int value and outputs two Fields: one with the original : value and one with the negative of the value. ... : OK, on the search side is where it gets tricky. The whole point of this : exercise is that the details are hidden from the user in the generic : case. Thus, a query of plusMinus:5 should automatically expand to : (plusMinus__0:5 OR plusMinus__1:-5). Of course, an expert user should ...nothing could match plusMinus__0:5 that didn't also match plusMinus__1:-5, so i don't really understand what the point of using the field expansion for a use case like this would be ... and that's making it hard for me to try and understand how this sort of system could/should/would be used at query time. Kind of, if a user just inputs plusMinus:5, then sure, but they may also want to just search the negative portion. More importantly, though, they may have a QParser or some other component that can appropriately select one of the fields w/o the user knowing. perhaps a more realistic example would be helpful? ...or even some differnet simple and contrived examples that demonstrate how this could be usefull in a way that isn't possible with a single field. OK, a more concrete example is spatial. A user will want to index a point as a lat lon. So, they index: field name=latLon49, -79/field. The implementation of how this gets indexed can be done in several ways. For starters, it can be represented as a single field using Geohash or even just as a string (even if that isn't useful for much). We don't need S-1131 for that at all. Next, they may just want to represent it as a two fields: one for the lat and one for the long. Again, not super hard to do now, but it requires the user to set it up, whereas with a LatLonFieldType, this would be hidden from them. Finally, consider the cartesian tier case. In this case, a single lat lon point could be mapped to a whole slew of tiers, where each tier is like a zoom level on a map application (like Google Maps). Here, we could have a CartesianTierFieldType, that takes in the lower and upper bounds of the tiers to represent, i.e. tier 4 through 17, and this would output 13 different fields.Local Solr currently handles this through dynamic fields and user level knowledge of the magic fields used. For this case, there are several different search patterns: 1. The user may know the tier they want to search at and thus input tier and a zoom level. 2. User invokes a QParser to build a bounding box (see https://issues.apache.org/jira/browse/SOLR-1568) and the Parser is responsible for creating a filter that chooses the most appropriate tier to search against. So, the user might just say: {!tier lat=X lon=Y dist=10} and it will pick the most appropriate tier, whereas putting in dist=50 would likely pick a different tier. Does that help? BTW, all of this is tracked via SOLR-773.
Re: SOLR-1131 - Multiple Fields per Field Type
I'm having a hard time wrapping my head arround this entire concept ... i know part of my problem is that your example use case seems somewhat nonsensical... : As a simple proof of concept, imagine that I define a new FieldType : called PlusMinusIntFieldType that extends IntField. This FieldType : takes in an int value and outputs two Fields: one with the original : value and one with the negative of the value. ... : OK, on the search side is where it gets tricky. The whole point of this : exercise is that the details are hidden from the user in the generic : case. Thus, a query of plusMinus:5 should automatically expand to : (plusMinus__0:5 OR plusMinus__1:-5). Of course, an expert user should ...nothing could match plusMinus__0:5 that didn't also match plusMinus__1:-5, so i don't really understand what the point of using the field expansion for a use case like this would be ... and that's making it hard for me to try and understand how this sort of system could/should/would be used at query time. perhaps a more realistic example would be helpful? ...or even some differnet simple and contrived examples that demonstrate how this could be usefull in a way that isn't possible with a single field. ? -Hoss