Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-10 Thread Grant Ingersoll

On Dec 10, 2009, at 1:01 AM, Mattmann, Chris A (388J) wrote:

 Hi Yonik et al.,
 
 I¹d like to add:
 
 Option C: Sub fields are specified as a attribute on the fieldType tag
 
 // needed to essentially define the point type
 fieldType name=latlon class=GeoPoint subFieldSuffix=_latlon ../
 
 // uses of the latlon type
 field name=home type=latlon indexed=true stored=false/
 
 // subFieldSuffix is appended to the subFields indexed and thus those would
 be:
 
 home_latlon_0
 home_latlon_1
 
 I like elements of Option B that you present below, however it seems to be
 mixing concerns. Type inheritance (aka your subFieldType attribute) seems
 to be orthogonal to poly fields -- a good idea, but another issue IMHO.
 


I'm not sure this works, as you need to specify the type of the subfield, which 
is what Option B does.   I don't think inheritance is the what is going on 
here, more like delegation, and that isn't necessarily needed for all 
implementations, but just happens to be how it is done for the example in 
question.  People implementing FieldTypes could certainly just encode things 
the way they want using their own internal mechanism (or the existing ones, but 
w/o configuration).


 Cheers,
 Chris
 
 On 12/9/09 1:12 PM, Yonik Seeley yo...@lucidimagination.com wrote:
 
 Proposal for handling points using only the field lookup mechanisms
 currently in place in IndexSchema:
 
 Option A: dynamic fields used for subfields, those dynamic fields need
 to be explicitly defined in the XML
 
 // needed to essentially define the point type
 fieldType name=latlon class=TrieDoubleFIeld precisionStep=8/
 fieldType name=point subFieldSuffix=_latlon .../
 dynamicField name=*_latlon type=latlon indexed=true stored=false/
 
 // uses of the point type
 field name=home type=point/
 dynamicField name=*_point type=point/
 
 // subFieldSuffix is appended to the subFields indexed and thus those would 
 be
 home__0_latlon
 home__1_latlon
 
 // And the indexed fields for dynamic field work_point would be
 work_point__0_latlon
 work_point__1_latlon
 
 // NOTE: this scheme works fine for subFields with different fieldTypes
 
 Option B: dynamic fields used for subfields, dynamic fields inserted
 into schema automatically
 
 // needed to essentially define the point type
 fieldType name=latlon class=TrieDoubleFIeld precisionStep=8/
 fieldType name=point subFieldType=latlon/
 
 // uses of the point type
 field name=home type=point/
 dynamicField name=*_point type=point/
 
 // A dynamic field is inserted into the schema by the point class of
 the form __subFieldTypeName by default.
 // This could be changed via an optional subFieldSuffix param on the
 point fieldType.  double underscore used
 // to minimize collisions with user-defined dynamic fields.
 home_0__latlon
 home_1__latlon
 
 // And the indexed fields for dynamic field work_point would be
 work_point__0__latlon
 work_point__1__latlon
 
 // NOTE: this scheme works fine for subFields with different fieldTypes
 
 
 -Yonik
 http://www.lucidimagination.com
 
 
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.mattm...@jpl.nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department University of
 Southern California, Los Angeles, CA 90089 USA
 ++
 
 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using 
Solr/Lucene:
http://www.lucidimagination.com/search



Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-10 Thread Mattmann, Chris A (388J)
Hi Grant,

On 12/10/09 3:16 AM, Grant Ingersoll gsing...@apache.org wrote:

 
 I'm not sure this works, as you need to specify the type of the subfield,
 which is what Option B does.   I don't think inheritance is the what is going
 on here, more like delegation, and that isn't necessarily needed for all
 implementations, but just happens to be how it is done for the example in
 question.  People implementing FieldTypes could certainly just encode things
 the way they want using their own internal mechanism (or the existing ones,
 but w/o configuration).

Well if that doesn't work then option A doesn't work either because it
doesn't specify the subFieldType. I'm also not sure that option B works too
because what if there are multiple subFieldTypes? For instance, what if I
want to store one of the polyField's subTypes as a tint,and the other as a
regular int? How would that be specified.

Either way you need some combination of A + B, or (my preference) B + C.

Cheers,
Chris


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department University of
Southern California, Los Angeles, CA 90089 USA
++




Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-10 Thread Mattmann, Chris A (388J)
I¹ll try and hold off, but also work on a patch for option (B+)C :)


On 12/10/09 7:37 AM, Grant Ingersoll gsing...@apache.org wrote:

 I have Option B implemented at this point, minus a few tests passing.  I'll
 put up a patch as soon as I get it working.


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department University of
Southern California, Los Angeles, CA 90089 USA
++




Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-10 Thread Chris Hostetter

: My current thought on #1 is that we probably don't want to change the
: internal lookup mechanism used by IndexSchema unless we gain
: significant power by doing so.  I'm not sure I currently see it.
: 
: My thoughts on #2 is more on a case-by-case basis.  For the simple
: case of a point class with two fields indexed separately, referencing
: a suffix that should be defined as a dynamic field vs referencing a
: type seem pretty close.  The latter, while perhaps slightly simpler
: for the user, seem to introduce a lot of hidden complexities.  I'm
: less concerned than Hoss is about name clashes, but much more
: concerned about those complexities.

...this was ultimately my point.  Name clashes were a concrete example i 
was trying to use to illustrate the (broader) complexity that seems 
involved in trying to develope a generalized PolyField system that is 100% 
transparent to both end users and schema creators.

My personal view is that none of this stuff should be transparent to the 
schema creator: we shouldn't try to hide things from them -- having 
defaults so they don't have to worry about some details is fine, but they 
should have control over those details if they want.

That said: although i have an opinion, it's not a particularly strong 
opinion.  Since i appear to be the minority here (and feel reasonably 
confident that i've explained my concerns well enough that you guys 
understand me point even if you don't agree with it) i'll just say i'm +1 
at leaving the schema creator entirely in control of the names created by 
PolyFields; and -0 at having fields created entirely transparently; ... 
and leave it at that.


-Hoss



Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-09 Thread Chris Hostetter

:  I'm not really understanding the value of an approach like that.  for
:  starters, what Lucene field names would ultimately be created in those
:  examples?  
: 
: The first field would be named location__location.
: The second field would be named location_home_location_home.
: The third field would be named location_work_location_work.

I'm not understanding your answer -- the whole point of this example is 
that this hypothetical PolyField example is that for each field the user 
knows about, it's builting up a lat and a lon field under the covers 
-- my question is what real, under the covers, names do you suggest be 
created from teh type of configuration you suggested?

:   field name=other_location type=latlon/
:   dynamicField name=*_dynamic_location type=latlon/
:  
:  ...then what field names would be created under the covers?
:  
: 
: In general, it would be FieldType#getPattern().stripOffEndRegexStarStuff() +
: Field#getName(). 

...that still feels a lot more obscure then letting the field 
declaration control things -- because as a schema creator i not only have 
to check that the field name i want to add doesn't conflict with an 
existing field name (or dynamicField name glob) but i also have to check 
that it doesn't conflict with a pattern in a PolyField fieldType 
declaration.

why make people check too things when using dynamicFields is something 
they already understand?

: Well if this feels wrong to you then I think the schema.xml file that ships
: with SOLR should also feel wrong as well because it uses the exact same
: pattern for defining field type variations. That is, differences between
: FieldType representations for ints and tints are not stored as variations on
: the SchemaField definition itself but they are stored as variation on the
: FieldTypes (e.g., a different precisionStep in the case of int [0] versus
: that of tint [8]). Based on what you are proposing, why isn't precisionStep
: an attribute on field, rather than fieldType in those examples?

There's a huge differnet there -- nothing in a fieldTYpe/ declaration 
right now has any influence what so ever on the ultimate name of the 
fields used -- field/ declaration's can inherit a lot of stuff from 
fieldType/ but we've never let the fieldType/ influence the name.

In an existing solr schema, i can have a list of fieldType/s 
and i can have a list of field/s that refrence those fieldType's by 
name -- and i can tweak the settings on those things laregly independently 
(as long as i reindex) ... but i never have to wworry that tweaking the 
setting of fieldType might completley break an index by causing the 
underlyling name of some field name=a.../ to suddenly collide with 
some other field name=a__a .../

: Possibly. It's also a lot less traceable. It's implicit versus explicit,
: which I'm not sure leads to simplicity in the end.

I feel the exact opposite actaully.  Saying these PolyField types will 
create multiple fields under the covers, and you have to use them 
as dynamicFields/ do control what names they use seems a lot more 
explict and easily tracable then these PolyField types will create 
multiple fileds under the covers, so you specify a pattern when you 
declare them, and then when you declare fields or dynamicFields that use 
them the following rule(s) will be applied to generate the underlying 
field names, so remember this rule when naming other fields to prevent 
conflicts.



-Hoss



Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-09 Thread Chris Hostetter

: fieldType name=point type=solr.PointType dimension=2 
subFieldType=double/
: field name=home type=point indexed=true stored=true/
...
: And a new document of:
: doc
: field name=point39.0 -79.434/field
: /doc
: 
: There are three fields created:
: home --  Contains the stored value
: home___0 - Contains 39.0 indexed as a double (as in the double FieldType, 
not just a double precision)
: home___1 - Contains -79.434 as a double 

Grant: All of this i understand -- the back and forth Mattmann and I have 
been having is specificly about the idea that the __0 and __1 should be 
more transparent when declaring the schema.  AS it stands right now, if i 
add this to my schema...

field name=home___0 type=int indexed=true stored=true/

...i can really break things.  The odds of that happening are probably 
low, but it would still be very easy to make this type more transparent to 
schema creators by requring that PolyFields be declared as dynamicFields. 
so your previous example would become...

: fieldType name=point type=solr.PointType dimension=2 
subFieldType=double/
: dynamicField name=home* type=point indexed=true stored=true/

...now if i'm stupid enough to add field name=home___0/ it's my own 
damn fault (just like it is right now w/o having PolyFields in Solr)

:  letting dynamicField/ drive everything just seems a *lot* simpler ... 
:  both as far as implementation, and as far as maintaining the schema.
: 
: I don't agree.  It requires more configuration and more knowledge by the end 
user and doesn't hid the details.

 1) My example requires 8 more characters then yours.
 2) The end user doesn't need to know it's a dynamic field, they still 
just deal with a field named home
 3) my whole point is that we shouldn't be hiding these details from the 
person editing the schema.xml




-Hoss



Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-09 Thread Chris Hostetter

: That's not how the Cartesian Field stuff works, but I think I see what 
: you are getting at and I would say I'm going to explicitly punt on that 
: right now.  Ultimately, I think when such a case comes up, the FieldType 
: needs to be configured to be able to determine this information.

I'm fine punting on it -- i don't understand half this stuff anyway -- i 
just wanted to raise the issue incase someone else said Ack! ... yes that 
is a big oversight in the API that will cause problems with X, Y, Z.



-Hoss



Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-09 Thread Grant Ingersoll

On Dec 9, 2009, at 2:04 PM, Chris Hostetter wrote:

 
 : fieldType name=point type=solr.PointType dimension=2 
 subFieldType=double/
 : field name=home type=point indexed=true stored=true/
   ...
 : And a new document of:
 : doc
 : field name=point39.0 -79.434/field
 : /doc
 : 
 : There are three fields created:
 : home --  Contains the stored value
 : home___0 - Contains 39.0 indexed as a double (as in the double FieldType, 
 not just a double precision)
 : home___1 - Contains -79.434 as a double 
 
 Grant: All of this i understand -- the back and forth Mattmann and I have 
 been having is specificly about the idea that the __0 and __1 should be 
 more transparent when declaring the schema.  AS it stands right now, if i 
 add this to my schema...
 
 field name=home___0 type=int indexed=true stored=true/
 
 ...i can really break things.  The odds of that happening are probably 
 low, but it would still be very easy to make this type more transparent to 
 schema creators by requring that PolyFields be declared as dynamicFields. 
 so your previous example would become...
 
 : fieldType name=point type=solr.PointType dimension=2 
 subFieldType=double/
 : dynamicField name=home* type=point indexed=true stored=true/
 
 ...now if i'm stupid enough to add field name=home___0/ it's my own 
 damn fault (just like it is right now w/o having PolyFields in Solr)
 
 :  letting dynamicField/ drive everything just seems a *lot* simpler ... 
 :  both as far as implementation, and as far as maintaining the schema.
 : 
 : I don't agree.  It requires more configuration and more knowledge by the 
 end user and doesn't hid the details.
 
 1) My example requires 8 more characters then yours.

It's not about the characters, obviously, it's about the mindset of the person 
doing the modeling, hence...

 2) The end user doesn't need to know it's a dynamic field, they still 
just deal with a field named home
 3) my whole point is that we shouldn't be hiding these details from the 
person editing the schema.xml


I'm not sure I agree.  I think people would expect to use a new Field Type in 
exactly the same ways the use existing Field Types, namely anywhere they want 
(dynamic or not).  We could easily validate the schema at start up time to see 
whether they have done the scenario you describe above and throw an exception.



Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-09 Thread Yonik Seeley
I haven't followed this whole thread... but I wanted to point out that
it probably intersects with the review of grant's latest patch that I
did here: https://issues.apache.org/jira/browse/SOLR-1131

I did want to cut'n'paste something from that post:
: I do want to separate these two issues though:
: 1) field lookup mechanism (currently just exact name in schema
followed by a dynamic field check)
: 2) if and when fields or field types should be explicitly defined in
the schema vs being created by the polyField

My current thought on #1 is that we probably don't want to change the
internal lookup mechanism used by IndexSchema unless we gain
significant power by doing so.  I'm not sure I currently see it.

My thoughts on #2 is more on a case-by-case basis.  For the simple
case of a point class with two fields indexed separately, referencing
a suffix that should be defined as a dynamic field vs referencing a
type seem pretty close.  The latter, while perhaps slightly simpler
for the user, seem to introduce a lot of hidden complexities.  I'm
less concerned than Hoss is about name clashes, but much more
concerned about those complexities.

-Yonik
http://www.lucidimagination.com


Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-09 Thread Mattmann, Chris A (388J)
Hi All,

 
 : fieldType name=point type=solr.PointType dimension=2
 subFieldType=double/
 : field name=home type=point indexed=true stored=true/
   ...
 : And a new document of:
 : doc
 : field name=point39.0 -79.434/field
 : /doc
 :
 : There are three fields created:
 : home --  Contains the stored value
 : home___0 - Contains 39.0 indexed as a double (as in the double FieldType,
 not just a double precision)
 : home___1 - Contains -79.434 as a double
 
 Grant: All of this i understand -- the back and forth Mattmann and I have
 been having is specificly about the idea that the __0 and __1 should be
 more transparent when declaring the schema.  AS it stands right now, if i
 add this to my schema...
 
 field name=home___0 type=int indexed=true stored=true/
 
 ...i can really break things.  The odds of that happening are probably
 low, but it would still be very easy to make this type more transparent to
 schema creators by requring that PolyFields be declared as dynamicFields.
 so your previous example would become...
 
 : fieldType name=point type=solr.PointType dimension=2
 subFieldType=double/
 : dynamicField name=home* type=point indexed=true stored=true/
 
 ...now if i'm stupid enough to add field name=home___0/ it's my own
 damn fault (just like it is right now w/o having PolyFields in Solr)
 
 :  letting dynamicField/ drive everything just seems a *lot* simpler ...
 :  both as far as implementation, and as far as maintaining the schema.
 :
 : I don't agree.  It requires more configuration and more knowledge by the
 end user and doesn't hid the details.
 
 1) My example requires 8 more characters then yours.
 
 It's not about the characters, obviously, it's about the mindset of the person
 doing the modeling, hence...

+1.

 
 2) The end user doesn't need to know it's a dynamic field, they still
just deal with a field named home
 3) my whole point is that we shouldn't be hiding these details from the
person editing the schema.xml
 
 
 I'm not sure I agree.  I think people would expect to use a new Field Type in
 exactly the same ways the use existing Field Types, namely anywhere they want
 (dynamic or not).  We could easily validate the schema at start up time to see
 whether they have done the scenario you describe above and throw an exception.
 

+1 to that, as well. I had mentioned in an earlier thread about using the
APIs and code to perform such a check.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department University of
Southern California, Los Angeles, CA 90089 USA
++




Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-09 Thread Yonik Seeley
On Wed, Dec 9, 2009 at 2:41 PM, Yonik Seeley yo...@lucidimagination.com wrote:
 So... the question is, do we have a concrete alternative to this that
 is well fleshed out?

I do, I do... just a little variant that is geo specific and hence
results in nicer names :-)

fieldType name=point latSuffix=_lat lonSuffix=_lon/
 field name=home type=point/
 dynamicField name=*_lat type=tdouble indexed=true stored=false/
 dynamicField name=*_lon type=tdouble indexed=true stored=false/

 dynamicField name=*_point type=point/

home_lat
home_lon

work_point_lat
work_point_lon

Note: if you want the double or tripple underscore to help prevent
collisions... then you could use latSuffix=___lat and define the
dynamic fields that way.

-Yonik
http://www.lucidimagination.com



On Wed, Dec 9, 2009 at 2:41 PM, Yonik Seeley yo...@lucidimagination.com wrote:
 Here's an example of how everything could work with dynamic fields
 (apologies if it it overlaps with examples already given by others in
 this thread) :

 fieldType name=point fieldSuffix=_latlon .../  // the subFields
 for the points end in _latlon
 field name=home type=point/
 dynamicField name=*_latlon type=tdouble indexed=true stored=false/

 // And we also want to allow point dynamic fields
 dynamicField name=*_point type=point/

 // Note: Grant make point more generic than geo, so it's 0 and 1
 instead of lat and lon
 // OK, so now the indexed fields for home would be
 home__0_latlon
 home__1_latlon

 // And the indexed fields for dynamic field work_point would be
 work_point__0_latlon
 work_point__1_latlon

 Not the prettiest names... but I think everything is well defined (how
 it would work with subFields of differing types.. have another param
 specifying a different suffix, how it works with dynamic fields, etc).



 -Yonik
 http://www.lucidimagination.com



Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-09 Thread Grant Ingersoll

On Dec 9, 2009, at 2:46 PM, Yonik Seeley wrote:

 On Wed, Dec 9, 2009 at 2:41 PM, Yonik Seeley yo...@lucidimagination.com 
 wrote:
 So... the question is, do we have a concrete alternative to this that
 is well fleshed out?
 
 I do, I do... just a little variant that is geo specific and hence
 results in nicer names :-)
 
 fieldType name=point latSuffix=_lat lonSuffix=_lon/
 field name=home type=point/
 dynamicField name=*_lat type=tdouble indexed=true stored=false/
 dynamicField name=*_lon type=tdouble indexed=true stored=false/
 
 dynamicField name=*_point type=point/
 
 home_lat
 home_lon
 
 work_point_lat
 work_point_lon
 
 Note: if you want the double or tripple underscore to help prevent
 collisions... then you could use latSuffix=___lat and define the
 dynamic fields that way.



Additionally, how do you deal w/ a point in a 3D (or n-D) space?

I just don't see why a user shouldn't be able to use the FieldType just like 
any other FieldType, dynamic or not.  I think it is easy enough to detect name 
collisions and you still get all the flexibility of dynamic fields.

So, for example, say I was modeling a user and their employment history.  Thus, 
I have a single home address plus multiple work addresses.  One way of doing 
this would be:

field name=home type=point/
dynamicField name=work_* type=point/

And that should all just work.  The user would just ever deal w/ home or 
work_*, but not have to deal w/ home___0 or whatever unless they really truly 
wanted to and even then I am not sure it is needed.

How would you do this with what is proposed above?  Seems like you'd have a 
whole proliferation of fields.

Also, I don't see why a FieldType should have a dep. on a Field.  Having a 
dependency on another FieldType seems reasonable, but I'm not sure about on a 
Field.

Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-09 Thread Yonik Seeley
On Wed, Dec 9, 2009 at 3:21 PM, Grant Ingersoll gsing...@apache.org wrote:
 Additionally, how do you deal w/ a point in a 3D (or n-D) space?

I guess you would go back to the way you did it (0,1,etc).  This was
really just a naming variation, not really a different approach.

 I just don't see why a user shouldn't be able to use the FieldType just like 
 any other FieldType, dynamic or not.  I think it is easy enough to detect 
 name collisions and you still get all the flexibility of dynamic fields.

 So, for example, say I was modeling a user and their employment history.  
 Thus, I have a single home address plus multiple work addresses.  One way of 
 doing this would be:

 field name=home type=point/
 dynamicField name=work_* type=point/

 And that should all just work.

But it isn't that simple: you needed to define the point type, and
that point type needed to reference/define another type.
In the dynamicField proposal, you need to define a _latlon dynamic
field once.  It's also a separate decision from the lookup mechanism
(dynamic field based, or add a new poly-field mechanism) - the point
field type could choose to dynamically register *_latlon if it isn't
already registered.

[...]
 How would you do this with what is proposed above?  Seems like you'd have a 
 whole proliferation of fields.

I thought I defined it well... hmmm.
I'll take another stab, outlining using dynamic fields in both
scenarios (explicitly defined dynamic fields, and automatically
defined as part of the creation of the point class).  I think we
really do need to get concrete about our options at this point.

-Yonik


Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-09 Thread Grant Ingersoll

On Dec 9, 2009, at 3:47 PM, Yonik Seeley wrote:
 
 I thought I defined it well... hmmm.
 I'll take another stab, outlining using dynamic fields in both
 scenarios (explicitly defined dynamic fields, and automatically
 defined as part of the creation of the point class).  I think we
 really do need to get concrete about our options at this point.

Agreed, code would be good.


Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-09 Thread Yonik Seeley
On Wed, Dec 9, 2009 at 3:49 PM, Grant Ingersoll gsing...@apache.org wrote:

 On Dec 9, 2009, at 3:47 PM, Yonik Seeley wrote:

 I thought I defined it well... hmmm.
 I'll take another stab, outlining using dynamic fields in both
 scenarios (explicitly defined dynamic fields, and automatically
 defined as part of the creation of the point class).  I think we
 really do need to get concrete about our options at this point.

 Agreed, code would be good.

I had code (untested) just using dynamic fields... you changed it :-P
But I meant actual fieldType and field definitions, and what fields
get indexed as a result, and how type lookups on those fields happens.

-Yonik
http://www.lucidimagination.com


Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-09 Thread Grant Ingersoll

On Dec 9, 2009, at 3:52 PM, Yonik Seeley wrote:

 On Wed, Dec 9, 2009 at 3:49 PM, Grant Ingersoll gsing...@apache.org wrote:
 
 On Dec 9, 2009, at 3:47 PM, Yonik Seeley wrote:
 
 I thought I defined it well... hmmm.
 I'll take another stab, outlining using dynamic fields in both
 scenarios (explicitly defined dynamic fields, and automatically
 defined as part of the creation of the point class).  I think we
 really do need to get concrete about our options at this point.
 
 Agreed, code would be good.
 
 I had code (untested) just using dynamic fields... you changed it :-P
 But I meant actual fieldType and field definitions, and what fields
 get indexed as a result, and how type lookups on those fields happens.
 

Fair enough!




Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-09 Thread Yonik Seeley
Proposal for handling points using only the field lookup mechanisms
currently in place in IndexSchema:

Option A: dynamic fields used for subfields, those dynamic fields need
to be explicitly defined in the XML

// needed to essentially define the point type
fieldType name=latlon class=TrieDoubleFIeld precisionStep=8/
fieldType name=point subFieldSuffix=_latlon .../
dynamicField name=*_latlon type=latlon indexed=true stored=false/

// uses of the point type
field name=home type=point/
dynamicField name=*_point type=point/

// subFieldSuffix is appended to the subFields indexed and thus those would be
home__0_latlon
home__1_latlon

// And the indexed fields for dynamic field work_point would be
work_point__0_latlon
work_point__1_latlon

// NOTE: this scheme works fine for subFields with different fieldTypes

Option B: dynamic fields used for subfields, dynamic fields inserted
into schema automatically

// needed to essentially define the point type
fieldType name=latlon class=TrieDoubleFIeld precisionStep=8/
fieldType name=point subFieldType=latlon/

// uses of the point type
field name=home type=point/
dynamicField name=*_point type=point/

// A dynamic field is inserted into the schema by the point class of
the form __subFieldTypeName by default.
// This could be changed via an optional subFieldSuffix param on the
point fieldType.  double underscore used
// to minimize collisions with user-defined dynamic fields.
home_0__latlon
home_1__latlon

// And the indexed fields for dynamic field work_point would be
work_point__0__latlon
work_point__1__latlon

// NOTE: this scheme works fine for subFields with different fieldTypes


-Yonik
http://www.lucidimagination.com


Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-09 Thread Grant Ingersoll
OK, I'm fine w/ taking this type of approach, as opposed to the lookup 
mechanism I have.  Of the two laid out below, there are pros and cons to both, 
as I see it. 

I'm inclined towards Option B.  This keeps it hidden from the user, but 
doesn't require extra work for Solr.   Let me code it up.


On Dec 9, 2009, at 4:12 PM, Yonik Seeley wrote:

 Proposal for handling points using only the field lookup mechanisms
 currently in place in IndexSchema:
 
 Option A: dynamic fields used for subfields, those dynamic fields need
 to be explicitly defined in the XML
 
 // needed to essentially define the point type
 fieldType name=latlon class=TrieDoubleFIeld precisionStep=8/
 fieldType name=point subFieldSuffix=_latlon .../
 dynamicField name=*_latlon type=latlon indexed=true stored=false/
 
 // uses of the point type
 field name=home type=point/
 dynamicField name=*_point type=point/
 
 // subFieldSuffix is appended to the subFields indexed and thus those would be
 home__0_latlon
 home__1_latlon
 
 // And the indexed fields for dynamic field work_point would be
 work_point__0_latlon
 work_point__1_latlon
 
 // NOTE: this scheme works fine for subFields with different fieldTypes
 
 Option B: dynamic fields used for subfields, dynamic fields inserted
 into schema automatically
 
 // needed to essentially define the point type
 fieldType name=latlon class=TrieDoubleFIeld precisionStep=8/
 fieldType name=point subFieldType=latlon/
 
 // uses of the point type
 field name=home type=point/
 dynamicField name=*_point type=point/
 
 // A dynamic field is inserted into the schema by the point class of
 the form __subFieldTypeName by default.
 // This could be changed via an optional subFieldSuffix param on the
 point fieldType.  double underscore used
 // to minimize collisions with user-defined dynamic fields.
 home_0__latlon
 home_1__latlon
 
 // And the indexed fields for dynamic field work_point would be
 work_point__0__latlon
 work_point__1__latlon
 
 // NOTE: this scheme works fine for subFields with different fieldTypes



Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-09 Thread Mattmann, Chris A (388J)
Hi Yonik et al.,

I¹d like to add:

Option C: Sub fields are specified as a attribute on the fieldType tag

// needed to essentially define the point type
 fieldType name=latlon class=GeoPoint subFieldSuffix=_latlon ../

// uses of the latlon type
 field name=home type=latlon indexed=true stored=false/

// subFieldSuffix is appended to the subFields indexed and thus those would
be:

home_latlon_0
home_latlon_1

I like elements of Option B that you present below, however it seems to be
mixing concerns. Type inheritance (aka your subFieldType attribute) seems
to be orthogonal to poly fields -- a good idea, but another issue IMHO.

Cheers,
Chris

On 12/9/09 1:12 PM, Yonik Seeley yo...@lucidimagination.com wrote:

 Proposal for handling points using only the field lookup mechanisms
 currently in place in IndexSchema:
 
 Option A: dynamic fields used for subfields, those dynamic fields need
 to be explicitly defined in the XML
 
 // needed to essentially define the point type
 fieldType name=latlon class=TrieDoubleFIeld precisionStep=8/
 fieldType name=point subFieldSuffix=_latlon .../
 dynamicField name=*_latlon type=latlon indexed=true stored=false/
 
 // uses of the point type
 field name=home type=point/
 dynamicField name=*_point type=point/
 
 // subFieldSuffix is appended to the subFields indexed and thus those would be
 home__0_latlon
 home__1_latlon
 
 // And the indexed fields for dynamic field work_point would be
 work_point__0_latlon
 work_point__1_latlon
 
 // NOTE: this scheme works fine for subFields with different fieldTypes
 
 Option B: dynamic fields used for subfields, dynamic fields inserted
 into schema automatically
 
 // needed to essentially define the point type
 fieldType name=latlon class=TrieDoubleFIeld precisionStep=8/
 fieldType name=point subFieldType=latlon/
 
 // uses of the point type
 field name=home type=point/
 dynamicField name=*_point type=point/
 
 // A dynamic field is inserted into the schema by the point class of
 the form __subFieldTypeName by default.
 // This could be changed via an optional subFieldSuffix param on the
 point fieldType.  double underscore used
 // to minimize collisions with user-defined dynamic fields.
 home_0__latlon
 home_1__latlon
 
 // And the indexed fields for dynamic field work_point would be
 work_point__0__latlon
 work_point__1__latlon
 
 // NOTE: this scheme works fine for subFields with different fieldTypes
 
 
 -Yonik
 http://www.lucidimagination.com
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department University of
Southern California, Los Angeles, CA 90089 USA
++




RE: SOLR-1131 - Multiple Fields per Field Type

2009-12-07 Thread Chris Hostetter

: fieldType name=latlon type=LatLonFieldType pattern=location__* /
: fieldType name=latlon_home type=LatLonFieldType 
pattern=location_home_*/
: fieldType name=latlon_work type=LatLonFieldType 
pattern=location_home_*/
: 
: field name=location type=latlon/
: field name=location_home type=latlon_home/
: field name=location_work type=latlon_work/

I'm not really understanding the value of an approach like that.  for 
starters, what Lucene field names would ultimately be created in those 
examples?  And if i also added...

 field name=other_location type=latlon/
 dynamicField name=*_dynamic_location type=latlon/

...then what field names would be created under the covers?

: I think it makes more sense to define the heterogeneity at the fieldType 
level because:
: 
: (a) it's a bit more consistent with the existing solr schema examples, 
: where the difference between many of the field types (e.g., ints and 
: tints, which are both solr.TrieIntField's, date and tdate, both 
: instances of solr.TrieDateField, with different configuration, etc.)
: 
: (b) isolation of change: fieldType defs will change less often than 
: field defs, where names and indexed/stored/etc. debugging are likely 
: to occur more frequently

...this just feels wrong to me ... i can't really explain why.  It seems 
like you are suggesting thatt every field/ declaration would need a one 
to one corrispondence with a unique fieldType/ declaration in order to 
prevent field name collisions, which sounds sketchy enough ... but i'm 
also not fond of the idea that a person editing the schema can't just look 
at the field/ and dynamicField/ names to ensure that they understand 
what underlying fields are being created (so they don't inadvertantly add 
a new one that collides) ... now they also have to look at the pattern 
attribute of every fieldType/ that is a poly field.

letting dynamicField/ drive everything just seems a *lot* simpler ... 
both as far as implementation, and as far as maintaining the schema.

: I don't think the above hybrid approach will lead to anything other than 
: confusion, as you indicated above. Let's stick to the pattern defs at 
: the fieldType level, and then let the fieldType handle the internal 
: dynamicity with e.g., a dynamicField, and then notify the schema user 

From the standpoint of reading a schema.xml file, the approach you're 
describing of a pattern attribute on fieldType/ declarations actaully 
seems more confusing then the strawman suggestion i made of a pattern 
attribute on field ... even without understanding what concrete feilds 
you are suggesting would be created with a configuration like that, it 
still increases the number of places you have to look to see what field 
names are getting created.


-Hoss



Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-07 Thread Chris Hostetter
: I'm not sure if you worry about it.  But I'd argue it isn't natural 
: anyway.  You would do the following instead, which is how any address 
: book I've ever seen works:
: field name=home type=LatLonFT/
: field name=work type=LatLonFT/

...the home vs work distinction was arbitrary.  the point is what if 
i want to support an arbitrary number of distinct values in a PolyField? 
... with your approach any attempt to search for people near X would 
require me to search for work near X or home near X ... which is analogous 
to oneof hte main purposes of multivalued fields: so i don't have to 
uniquely name every Field instance.  I might have a thousand unique 
(but unamed) locations that i want to associate with a document, and i 
want to search for documents with a location near X ... likewise i might 
have thousands unique polygons associated with a document and i want to 
search for documents where one or more polygons overlap with an input 
polygon (ie: island nations overlapping with the flight path of an 
airplane).

The question is: how can/would PolyFields deal with input like this? .. 
we've discussed cardniality in the number of fields produced by a single 
input value, but we haven't really discussed cardinality in the number of 
input values.

: So, maybe the FT can explicitly prohibit multivalued?  But, I suppose 
: you could do the position thing, too.  This could be achieved through a 
: new SpanQuery pretty easily:  SpanPositionQuery that takes in a term and 
: a specific position.  Trivial to write, I think, just not sure if it is 
: generally useful.  Although, I must say I've been noodling around with 

The problem is how do you let the PolyField specify the position when 
indexing?  the last API i saw fleshed out in this discussion didn't give 
the PolyField any information about how many input values were in any 
given doc, it just allowed PolyFields to be String=Field[] black boxes 
(as opposed to the String=Field[] black box FieldTYpes must currently 
be).

We can't assume even basic lastPostion+1 type logic for these 
polyfields, because differnet input values might produce Filed arrays 
containing different quantities of fields, with differnet names.  if a 
CartiesienPolyFieldType can get away with only using the grid_level1 and 
grid_level2 fields for one input value, but other input values require 
using grid_level2, grid_level2, and grid_level3, then simple position 
increments aren't enough if a document has multiple values (some of which 
need 2 different Field names, and others that need 3)


-Hoss



Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-07 Thread Grant Ingersoll

On Dec 7, 2009, at 5:59 PM, Chris Hostetter wrote:

 
 : fieldType name=latlon type=LatLonFieldType pattern=location__* /
 : fieldType name=latlon_home type=LatLonFieldType 
 pattern=location_home_*/
 : fieldType name=latlon_work type=LatLonFieldType 
 pattern=location_home_*/
 : 
 : field name=location type=latlon/
 : field name=location_home type=latlon_home/
 : field name=location_work type=latlon_work/
 
 I'm not really understanding the value of an approach like that.  for 
 starters, what Lucene field names would ultimately be created in those 
 examples?  And if i also added...


Have a look at the patch I put up today.  I think it is going to work quite 
well, but that could be jet-lag induced delirium at this point.

For a field type:
fieldType name=point type=solr.PointType dimension=2 
subFieldType=double/

and a field declared as:
field name=home type=point indexed=true stored=true/

And a new document of:
doc
field name=point39.0 -79.434/field
/doc

There are three fields created:
home --  Contains the stored value
home___0 - Contains 39.0 indexed as a double (as in the double FieldType, not 
just a double precision)
home___1 - Contains -79.434 as a double 



 
 field name=other_location type=latlon/
 dynamicField name=*_dynamic_location type=latlon/
 
 ...then what field names would be created under the covers?
 
 : I think it makes more sense to define the heterogeneity at the fieldType 
 level because:
 : 
 : (a) it's a bit more consistent with the existing solr schema examples, 
 : where the difference between many of the field types (e.g., ints and 
 : tints, which are both solr.TrieIntField's, date and tdate, both 
 : instances of solr.TrieDateField, with different configuration, etc.)
 : 
 : (b) isolation of change: fieldType defs will change less often than 
 : field defs, where names and indexed/stored/etc. debugging are likely 
 : to occur more frequently
 
 ...this just feels wrong to me ... i can't really explain why.  It seems 
 like you are suggesting thatt every field/ declaration would need a one 
 to one corrispondence with a unique fieldType/ declaration in order to 
 prevent field name collisions, which sounds sketchy enough ... but i'm 
 also not fond of the idea that a person editing the schema can't just look 
 at the field/ and dynamicField/ names to ensure that they understand 
 what underlying fields are being created (so they don't inadvertantly add 
 a new one that collides) ... now they also have to look at the pattern 
 attribute of every fieldType/ that is a poly field.
 
 letting dynamicField/ drive everything just seems a *lot* simpler ... 
 both as far as implementation, and as far as maintaining the schema.

I don't agree.  It requires more configuration and more knowledge by the end 
user and doesn't hid the details.



Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-07 Thread Grant Ingersoll

On Dec 7, 2009, at 6:13 PM, Chris Hostetter wrote:

 : I'm not sure if you worry about it.  But I'd argue it isn't natural 
 : anyway.  You would do the following instead, which is how any address 
 : book I've ever seen works:
 : field name=home type=LatLonFT/
 : field name=work type=LatLonFT/
 
 ...the home vs work distinction was arbitrary.  the point is what if 
 i want to support an arbitrary number of distinct values in a PolyField? 

This is the beauty of Yonik's addition of getFieldQuery() to the FieldType.  
The FieldType will be aware of the arbitrariness.  Furthermore, it can 
reflect on the index itself via IndexReader.getFieldNames() to determine the 
number of Fields that actually exist if it has to.  However, my guess is that 
in practice in most situations the FieldType author/user will have the info it 
needs.  Still, I think we can also evolve if we need to.

 ... with your approach any attempt to search for people near X would 
 require me to search for work near X or home near X ... which is analogous 
 to oneof hte main purposes of multivalued fields: so i don't have to 
 uniquely name every Field instance.  

Sure, but would you really ever model multiple locations like that in the same 
field?  I don't think in practice that you would, so I think it is a bit of a 
red herring.  Perhaps there is a different use case that better demonstrates it?

 I might have a thousand unique 
 (but unamed) locations that i want to associate with a document, and i 
 want to search for documents with a location near X ... likewise i might 
 have thousands unique polygons associated with a document and i want to 
 search for documents where one or more polygons overlap with an input 
 polygon (ie: island nations overlapping with the flight path of an 
 airplane).

I don't think this implementation precludes that.  The FunctionQueries only 
operating on a single valued field does, however.  Setting that aside, we could 
write a Query that does what you want, I think.

 
 The question is: how can/would PolyFields deal with input like this? .. 
 we've discussed cardniality in the number of fields produced by a single 
 input value, but we haven't really discussed cardinality in the number of 
 input values.

I'm not sure that it does, but I don't know that it needs to just yet.  This 
might be where an R-Tree implementation comes in handy, but I'll leave it to 
the geo-experts to discuss more.  

I also am not sure how the PolyField case is any different than the dynamic 
field case.  Either way, something needs to know the names of the fields that 
were created.


 
 : So, maybe the FT can explicitly prohibit multivalued?  But, I suppose 
 : you could do the position thing, too.  This could be achieved through a 
 : new SpanQuery pretty easily:  SpanPositionQuery that takes in a term and 
 : a specific position.  Trivial to write, I think, just not sure if it is 
 : generally useful.  Although, I must say I've been noodling around with 
 
 The problem is how do you let the PolyField specify the position when 
 indexing?  the last API i saw fleshed out in this discussion didn't give 
 the PolyField any information about how many input values were in any 
 given doc, it just allowed PolyFields to be String=Field[] black boxes 
 (as opposed to the String=Field[] black box FieldTYpes must currently 
 be).
 
 We can't assume even basic lastPostion+1 type logic for these 
 polyfields, because differnet input values might produce Filed arrays 
 containing different quantities of fields, with differnet names.  if a 
 CartiesienPolyFieldType can get away with only using the grid_level1 and 
 grid_level2 fields for one input value, but other input values require 
 using grid_level2, grid_level2, and grid_level3, then simple position 
 increments aren't enough if a document has multiple values (some of which 
 need 2 different Field names, and others that need 3)


That's not how the Cartesian Field stuff works, but I think I see what you are 
getting at and I would say I'm going to explicitly punt on that right now.  
Ultimately, I think when such a case comes up, the FieldType needs to be 
configured to be able to determine this information.

-Grant

Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-07 Thread Mattmann, Chris A (388J)
Hi Hoss,

 
 : fieldType name=latlon type=LatLonFieldType pattern=location__* /
 : fieldType name=latlon_home type=LatLonFieldType
 pattern=location_home_*/
 : fieldType name=latlon_work type=LatLonFieldType
 pattern=location_home_*/
 :
 : field name=location type=latlon/
 : field name=location_home type=latlon_home/
 : field name=location_work type=latlon_work/
 
 I'm not really understanding the value of an approach like that.  for
 starters, what Lucene field names would ultimately be created in those
 examples?  

The first field would be named location__location.
The second field would be named location_home_location_home.
The third field would be named location_work_location_work.

 And if i also added...
 
  field name=other_location type=latlon/
  dynamicField name=*_dynamic_location type=latlon/
 
 ...then what field names would be created under the covers?
 

In general, it would be FieldType#getPattern().stripOffEndRegexStarStuff() +
Field#getName(). 

 : I think it makes more sense to define the heterogeneity at the fieldType
 level because:
 :
 : (a) it's a bit more consistent with the existing solr schema examples,
 : where the difference between many of the field types (e.g., ints and
 : tints, which are both solr.TrieIntField's, date and tdate, both
 : instances of solr.TrieDateField, with different configuration, etc.)
 :
 : (b) isolation of change: fieldType defs will change less often than
 : field defs, where names and indexed/stored/etc. debugging are likely
 : to occur more frequently
 
 ...this just feels wrong to me ... i can't really explain why.  It seems
 like you are suggesting thatt every field/ declaration would need a one
 to one corrispondence with a unique fieldType/ declaration in order to
 prevent field name collisions, which sounds sketchy enough ... but i'm
 also not fond of the idea that a person editing the schema can't just look
 at the field/ and dynamicField/ names to ensure that they understand
 what underlying fields are being created (so they don't inadvertantly add
 a new one that collides) ... now they also have to look at the pattern
 attribute of every fieldType/ that is a poly field.

Well if this feels wrong to you then I think the schema.xml file that ships
with SOLR should also feel wrong as well because it uses the exact same
pattern for defining field type variations. That is, differences between
FieldType representations for ints and tints are not stored as variations on
the SchemaField definition itself but they are stored as variation on the
FieldTypes (e.g., a different precisionStep in the case of int [0] versus
that of tint [8]). Based on what you are proposing, why isn't precisionStep
an attribute on field, rather than fieldType in those examples?

 
 letting dynamicField/ drive everything just seems a *lot* simpler ...
 both as far as implementation, and as far as maintaining the schema.

Possibly. It's also a lot less traceable. It's implicit versus explicit,
which I'm not sure leads to simplicity in the end.

 
 : I don't think the above hybrid approach will lead to anything other than
 : confusion, as you indicated above. Let's stick to the pattern defs at
 : the fieldType level, and then let the fieldType handle the internal
 : dynamicity with e.g., a dynamicField, and then notify the schema user
 
 From the standpoint of reading a schema.xml file, the approach you're
 describing of a pattern attribute on fieldType/ declarations actaully
 seems more confusing then the strawman suggestion i made of a pattern
 attribute on field ... even without understanding what concrete feilds
 you are suggesting would be created with a configuration like that, it
 still increases the number of places you have to look to see what field
 names are getting created.

How so? In actuality, it reduces it. Instead of having pattern definitions
on fields (which there is a greater chance of having more of), you have them
on field types?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department University of
Southern California, Los Angeles, CA 90089 USA
++




RE: SOLR-1131 - Multiple Fields per Field Type

2009-12-04 Thread Steven A Rowe
Hi Grant,

On 12/02/2009 at 2:30 PM, Grant Ingersoll wrote:
 I've been noodling around with the idea with the notion of a
 layered field where variants of a primary token are stored at
 sub positions of the primary token (instead of in separate copy
 fields)

The Indri search engine (now part of Lemur) uses a similar idea: fields are 
implemented as potentially overlapping extents over the (single) stream of 
document tokens.  (Howard Turtle, who is now the CNLP director, and has been 
involved in Indri development, told me about this feature.  He says it allows 
for natural representation of fields projected onto hierarchical data, e.g. 
XML.)  I wasn't able to find much documentation about this online when I looked 
just now, but here's a high-level overview of the Indri repository (aka 
index) structure:

http://www.lemurproject.org/docs/index.php/Indri_Repository_Structure

(See the Field Information Files section near the bottom.)

Steve



Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-04 Thread Erik Hatcher
And this is also an approach Yonik drafted here for user/tagging  
design: http://wiki.apache.org/solr/UserTagDesign


Erik


On Dec 4, 2009, at 1:35 PM, Steven A Rowe wrote:


Hi Grant,

On 12/02/2009 at 2:30 PM, Grant Ingersoll wrote:

I've been noodling around with the idea with the notion of a
layered field where variants of a primary token are stored at
sub positions of the primary token (instead of in separate copy
fields)


The Indri search engine (now part of Lemur) uses a similar idea:  
fields are implemented as potentially overlapping extents over the  
(single) stream of document tokens.  (Howard Turtle, who is now the  
CNLP director, and has been involved in Indri development, told me  
about this feature.  He says it allows for natural representation of  
fields projected onto hierarchical data, e.g. XML.)  I wasn't able  
to find much documentation about this online when I looked just now,  
but here's a high-level overview of the Indri repository (aka  
index) structure:


http://www.lemurproject.org/docs/index.php/Indri_Repository_Structure

(See the Field Information Files section near the bottom.)

Steve





Re: SOLR-1131 - Multiple Fields per Field Type

2009-12-02 Thread Grant Ingersoll

On Dec 1, 2009, at 1:42 AM, Chris Hostetter wrote:

 
 It feels like something we've overlooked in this discussion is whether we 
 need to worry about any FieldType API changes needed to make these new 
 PolyField classes aware of when they are multivalued.
 
 The API suggestions grant made gives the FieldTYpe the ability to return a 
 Filed[] from a single field value input -- but it doesn't provide any 
 information about wether that field value is one of many values we're 
 indexing for this field name.
 
 Imagine that i want to make an index of people i know.  Each person also 
 has multiple locations where they can frequently be found (home, work, 
 gym, girlfriends house, favorite coffee shop, etc..).  My common case is 
 to search for people, not locations, so it doesn't make sense to flatten 
 out and have a doc for each person+location, i just want a single doc per 
 person, but htat means i need a locations field that's multivalued.
 
 If i'm using a simple LatLonFieldType that splits my comma seperated 
 coordinate string into a locations__LAT and a locations__LON field 
 then  iassume it needs to do something special in the multiValued case to 
 make sure later near searches don't get confused and think that the lat 
 from my work and the lon from my home are actaully a third location.
 
 how do we solve this?

I'm not sure if you worry about it.  But I'd argue it isn't natural anyway.  
You would do the following instead, which is how any address book I've ever 
seen works:
field name=home type=LatLonFT/
field name=work type=LatLonFT/

So, maybe the FT can explicitly prohibit multivalued?   But, I suppose you 
could do the position thing, too.  This could be achieved through a new 
SpanQuery pretty easily:  SpanPositionQuery that takes in a term and a specific 
position.  Trivial to write, I think, just not sure if it is generally useful.  
Although, I must say I've been noodling around with the idea with the notion of 
a layered field where variants of a primary token are stored at sub 
positions of the primary token (instead of in separate copy fields) and then 
one could write a query that says, for instance, search all of the secondary 
terms.  So, for instance, if you think of each position containing a stack of 
terms, then you could say use the terms at position two in the stack.  I'm not 
quite sure what this means just yet, but my thinking is that I could get a 
really compact index at the cost of a slightly more complex query.  It also 
means I would do some interesting things at query time that simply cannot be 
done across fields at the moment, for instance, create a phrase type query that 
used different layers where appropriate.

-Grant

Re: SOLR-1131 - Multiple Fields per Field Type

2009-11-30 Thread Chris Hostetter

: Maybe, but something needs that logic. Think relational database -- if you
: try and add a field to a schema (e.g., using some DBMS client GUI or vanilla
: command line SQL) where that name already exists, then you get a SQL
: exception. Similarly, SOLR should support such concepts. Maybe it doesn't go
...
: How are you screwed? If you add a field name that collides then as long as
: the FieldType checked you'd still be OK? Maybe FieldTypes that support
: multi-internal fields should have a requirement that they be configured from
: the schema.xml file themselves, so that the user configuring the entire
: schema can be made to deal with the namespacing at that level -- then if she

...but these are just two different solutions that illustrate my overall 
point: one way or another, the person editing the schema.xml file needs to 
know that these FieldTypes are going to be adding additional fields to the 
index and has to be aware of the posibility of field name collisions -- 
either because they are required to configure what those additional 
fieldnames are for these FieldTypes to prevent collisions, or because they 
have to know how to make sense of errors that might get logged if the 
system detects a collision.

So rather then try to make it entirely magical and behind the scnes, and 
still require them to know about it if a collision happens and they get an 
error, let's put it right out in front of them so they know about it and 
think it through.

if people feel that something like this...

  fieldType name=latlon type=LatLonFieldType /
  dynamicField name=location* type=latlon /

...where an end user can deal with these fields...

location
location_home
location_work

...and under the covers the field type uses...

location__LAT + location__LON
location_home__LAT + location_home__LON
location_work__LAT + location_work__LON

...is an abuse of the dynamicField/ syntax, then we could accomplish the 
same thing with something like...

  fieldType name=latlon type=LatLonFieldType /
  field name=location type=latlon pattern=location__* /
  field name=location_home type=latlon pattern=location_home__* /
  field name=location_work type=latlon pattern=location_work__* /

...but that would be more verbose, and would be somewhat confusing to try 
and use as a true dynamicField (ie: we want to support home, work and 
anything else picked at run time)...

  fieldType name=latlon type=LatLonFieldType /
  field name=location type=latlon pattern=location* /
  dynamicfield name=location_* type=latlon pattern=??whagoeshere?? /


so why not just leverage the existing dynamicFieldsyntax/mechanism where 
schema creators already expect fields to be created at runtime, and 
already have to think about possible name collisions?

-Hoss



Re: SOLR-1131 - Multiple Fields per Field Type

2009-11-30 Thread Chris Hostetter

It feels like something we've overlooked in this discussion is whether we 
need to worry about any FieldType API changes needed to make these new 
PolyField classes aware of when they are multivalued.

The API suggestions grant made gives the FieldTYpe the ability to return a 
Filed[] from a single field value input -- but it doesn't provide any 
information about wether that field value is one of many values we're 
indexing for this field name.

Imagine that i want to make an index of people i know.  Each person also 
has multiple locations where they can frequently be found (home, work, 
gym, girlfriends house, favorite coffee shop, etc..).  My common case is 
to search for people, not locations, so it doesn't make sense to flatten 
out and have a doc for each person+location, i just want a single doc per 
person, but htat means i need a locations field that's multivalued.

If i'm using a simple LatLonFieldType that splits my comma seperated 
coordinate string into a locations__LAT and a locations__LON field 
then  iassume it needs to do something special in the multiValued case to 
make sure later near searches don't get confused and think that the lat 
from my work and the lon from my home are actaully a third location.

how do we solve this?

I suppose we could just rely on mathing termPosition information, but that 
means the FieldType needs a way to specify the Analyzer for all of the 
field names it creates on the fly (another argument for reusing 
dynamicFields i guess) to specify matching increments -- but that seems 
somewhat brittle: what about complex PolyFieldTypes that want to create 
variable number of Field's based on the input?

ie: as i recall, if you want to index coordinates of polygon bounding 
boxes using cartisien grid fields, you need more field names for big 
polygons then you do for small polygons -- so what if someone wants a 
multivalued PolyField and indexes very big and very small polygons? ... 
termPositions doens't seem like it really cuts it here.



-Hoss



RE: SOLR-1131 - Multiple Fields per Field Type

2009-11-30 Thread Mattmann, Chris A (388J)
Hey Hoss,

 So rather then try to make it entirely magical and behind the scnes, and
 still require them to know about it if a collision happens and they get an
 error, let's put it right out in front of them so they know about it and
 think it through.

+1 to that -- was never trying to make anything magical, just to point out that 
there were a number of different solutions here, not all of which are 
orthogonal (as you pointed out above, SOLR may use a combination of intuitive 
log messages + explicit collision handling in code, not just one or the other).

 if people feel that something like this...

   fieldType name=latlon type=LatLonFieldType /
   dynamicField name=location* type=latlon /

 ...where an end user can deal with these fields...
 
location
location_home
location_work

 ...and under the covers the field type uses...

location__LAT + location__LON
location_home__LAT + location_home__LON
location_work__LAT + location_work__LON

 ...is an abuse of the dynamicField/ syntax, then we could accomplish the
 same thing with something like...
 
  fieldType name=latlon type=LatLonFieldType /
  field name=location type=latlon pattern=location__* /
  field name=location_home type=latlon pattern=location_home__* /
  field name=location_work type=latlon pattern=location_work__* /

Now you're talking. I like this option, with the following updates:

fieldType name=latlon type=LatLonFieldType pattern=location__* /
fieldType name=latlon_home type=LatLonFieldType pattern=location_home_*/
fieldType name=latlon_work type=LatLonFieldType pattern=location_home_*/

field name=location type=latlon/
field name=location_home type=latlon_home/
field name=location_work type=latlon_work/

I think it makes more sense to define the heterogeneity at the fieldType level 
because:

(a) it's a bit more consistent with the existing solr schema examples, where 
the difference between many of the field types (e.g., ints and tints, which are 
both solr.TrieIntField's, date and tdate, both instances of solr.TrieDateField, 
with different configuration, etc.)

(b) isolation of change: fieldType defs will change less often than field 
defs, where names and indexed/stored/etc. debugging are likely to occur more 
frequently

...but that would be more verbose, and would be somewhat confusing to try
and use as a true dynamicField (ie: we want to support home, work and
anything else picked at run time)...

  fieldType name=latlon type=LatLonFieldType /
  field name=location type=latlon pattern=location* /
  dynamicfield name=location_* type=latlon pattern=??whagoeshere?? /

I don't think the above hybrid approach will lead to anything other than 
confusion, as you indicated above. Let's stick to the pattern defs at the 
fieldType level, and then let the fieldType handle the internal dynamicity 
with e.g., a dynamicField, and then notify the schema user by providing: (1) a 
nice intuitive set of documentation with the poly field types that says: don't 
use these reserved field names in your schema if you are using this field type 
in any of your field instances (the concept is the same as in P/L's -- you can 
declare variables named for or int, etc.); and (2) intuitive error msgs and 
exceptions if the schema user insists on ignoring the poly field documentation.

so why not just leverage the existing dynamicFieldsyntax/mechanism where
schema creators already expect fields to be created at runtime, and
already have to think about possible name collisions?

I think we should leverage dynamicFields, but maybe not explicitly. Then you 
have to maintain the poly field def as both a dynamicField and fieldType, which 
IMHO is not as elegant as multiple field type def (configured instances of the 
same field type) with the pattern param you suggested, coupled with field 
declarations that use those fieldType configured instances.

Cheers,
Chris



RE: SOLR-1131 - Multiple Fields per Field Type

2009-11-30 Thread Mattmann, Chris A (388J)
Hey Hoss,

From: Chris Hostetter [hossman_luc...@fucit.org]
Sent: Monday, November 30, 2009 5:42 PM
To: solr-dev@lucene.apache.org
Subject: Re: SOLR-1131 - Multiple Fields per Field Type

It feels like something we've overlooked in this discussion is whether we
need to worry about any FieldType API changes needed to make these new
PolyField classes aware of when they are multivalued.

The API suggestions grant made gives the FieldTYpe the ability to return a
Filed[] from a single field value input -- but it doesn't provide any
information about wether that field value is one of many values we're
indexing for this field name.

Imagine that i want to make an index of people i know.  Each person also
has multiple locations where they can frequently be found (home, work,
gym, girlfriends house, favorite coffee shop, etc..).  My common case is
to search for people, not locations, so it doesn't make sense to flatten
out and have a doc for each person+location, i just want a single doc per
person, but htat means i need a locations field that's multivalued.

If i'm using a simple LatLonFieldType that splits my comma seperated
coordinate string into a locations__LAT and a locations__LON field
then  iassume it needs to do something special in the multiValued case to
make sure later near searches don't get confused and think that the lat
from my work and the lon from my home are actaully a third location.

how do we solve this?

I suppose we could just rely on mathing termPosition information, but that
means the FieldType needs a way to specify the Analyzer for all of the
field names it creates on the fly (another argument for reusing
dynamicFields i guess)

* or, alternatively, fieldTypes with configured pattern params *

to specify matching increments -- but that seems
somewhat brittle: what about complex PolyFieldTypes that want to create
variable number of Field's based on the input?

* This would seem to argue for smart FieldTypes that understand how their 
information is persisted (not just pattern parameters), but perhaps something 
that's difficult to codify in XML versus an actual P/L. Increments might be the 
only variant, but there may be more *

ie: as i recall, if you want to index coordinates of polygon bounding
boxes using cartisien grid fields, you need more field names for big
polygons then you do for small polygons -- so what if someone wants a
multivalued PolyField and indexes very big and very small polygons? ...
termPositions doens't seem like it really cuts it here.

* good food for thought -- I'll sleep on it tonight and see what I can think of 
to add to the discussion...*

Cheers,
Chris


Re: SOLR-1131 - Multiple Fields per Field Type

2009-11-29 Thread Grant Ingersoll

On Nov 28, 2009, at 7:37 PM, Chris Hostetter wrote:

 
 : I don't think it's useful to somehow programmatically access the list
 : of fields that a fieldType could output.
 
 based on my understanding of the potential types of use cases we're 
 talking about, i think i agree with you.  It seems like the most crucial 
 aspect is that a FieldType has a way of producing multiple 
 o.a.l.document.Field instances (potentially with different field names) 
 from a single String input at index time.  this can be done with something 
 like the API that Grant mentioned earlier.
 
 For anything except non-trivial use cases, any code (like a query parser) 
 attempting to deal with this fields is going to need to be very special 
 purpose and have direct knowledge of the code in the FieldType.  if a 
 CartesienGeoSearchQParser is asked to parse 
 store_location:89.3,45.4~5miles it can throw a parse exception if 
 IndexSchema.getFieldType(store_location) isn't an instanceof 
 CartesianGeoSearchFieldType -- assuming it is: it can cast it and call 
 CartesianGeoSearchFieldType specific methods to find out everything it 
 needs to know about what multitudes of field names that specific instance 
 produced based on it's configuration.  (side thought: we may want to add a 
 getFieldTypesByClass method to the IndexSchema so QParsers and 
 SearchComponents can get lists of fields matching special cases they want 
 to know about -- but that's a secondary concern)
 
 One thing that concerns me is potential field name collision -- where one 
 of these new multifield producing FieldTypes might want to creat a name 
 that happens to collide with a field the user has already declared.  
 
 Using Double underscores kind of feels like a hack, what i keep wondering 
 is if we can't leverage dynamicFields here.  if we require that these 
 FieldTypes be declared using dynamicField delcarations (they could error 
 on init otherwise) then the wildcard nature of the name tells the 
 FieldType where it's allowed to add things to the pattern to make unique 
 field names in the index -- and they can still be used as true dynamic 
 fields, as long as they always add to the field name given to them. 
 something like...
 
   dynamicField name=location* type=geo1 /

I thought about this too.  It is what Local Solr currently does (although it 
expects a certain prefix, too, I believe).  However, it seems a bit 
unnecessary, as now the user needs to use both the field type and the dynamic 
field in order to get it to work, whereas I don't think they should have to do 
that, as it isn't in line with the notion of a field type.  FieldTypes 
currently can be used for any fields, both regular and dynamic.

Re: SOLR-1131 - Multiple Fields per Field Type

2009-11-29 Thread Chris Hostetter

:  One thing that concerns me is potential field name collision -- where one
:  of these new multifield producing FieldTypes might want to creat a name
:  that happens to collide with a field the user has already declared.
: 
: Since FieldTypes are provided an instance of o.a.solr.schema.IndexScehma,
: couldn't these special MultiFieldTypes just call #getFieldOrNull(fieldName)
: to find out if an internal field they want to name is available in the
: schema namespace?

that would tell them if a field name is currently in use, but not what to 
do about it if it is already in use -- FieldType classes shouldn't need 
complicated hueristics to figure out somethign the user could configure.

Even if the FieldType's where crazy smart -- it wouldn't provide any 
future-proofing.  a FieldType could inspect the schema and see that 
certain fieldnames aren't in use right now, but then i could change my 
schema and add a field name that *does* collide with something it 
previously picked, and now i'm screwed.


-Hoss



Re: SOLR-1131 - Multiple Fields per Field Type

2009-11-29 Thread Chris Hostetter

: I thought about this too.  It is what Local Solr currently does 
: (although it expects a certain prefix, too, I believe).  However, it 
: seems a bit unnecessary, as now the user needs to use both the field 
: type and the dynamic field in order to get it to work, whereas I don't 
: think they should have to do that, as it isn't in line with the notion 
: of a field type.  FieldTypes currently can be used for any fields, both 
: regular and dynamic.

we already have FieldTypes that only make sense when used as either a 
field/ or as a dynamicField/ ... RandomField only makes sense when 
used as a dynamicField, ExternalValueField doesn't make sense if you try 
to use it as a dynamicField -- it's just hte nature of specialized 
FieldTypes.

It's one thing to say that we don't want search/index users to have to 
know about the details of how these fields work -- i agree with that, they 
should just be be able to index and query against a location field and 
have it work, without knowing that location actually builds up a bunch 
of cartisien grid fields using names like location_0DAB9 ... but i think 
it's perfectly acceptible to ask that the schema creator / solr 
addministrator have som understanding of these special field types, and 
to tell them you need to declare these as dynamicField/ because they 
add other low level fields using that prefix/suffix that you don't need to 
worry about.

The admin type users are going to need to know about these automagically 
created fields one way or another -- if not to prevent collision, then to 
make sure they don't get confused when they look at Luke and the schema 
browser.


-Hoss



Re: SOLR-1131 - Multiple Fields per Field Type

2009-11-29 Thread Mattmann, Chris A (388J)
Hi Hoss,

On 11/29/09 12:22 PM, Chris Hostetter hossman_luc...@fucit.org wrote:

 that would tell them if a field name is currently in use, but not what to
 do about it if it is already in use -- FieldType classes shouldn't need
 complicated hueristics to figure out somethign the user could configure.
 

Maybe, but something needs that logic. Think relational database -- if you
try and add a field to a schema (e.g., using some DBMS client GUI or vanilla
command line SQL) where that name already exists, then you get a SQL
exception. Similarly, SOLR should support such concepts. Maybe it doesn't go
in FieldType (though I'm not convinced of that since I don't think it's as
complicated as is implied), but at the very least it should go into
IndexSchema.


 Even if the FieldType's where crazy smart -- it wouldn't provide any
 future-proofing.  a FieldType could inspect the schema and see that
 certain fieldnames aren't in use right now, but then i could change my
 schema and add a field name that *does* collide with something it
 previously picked, and now i'm screwed.

How are you screwed? If you add a field name that collides then as long as
the FieldType checked you'd still be OK? Maybe FieldTypes that support
multi-internal fields should have a requirement that they be configured from
the schema.xml file themselves, so that the user configuring the entire
schema can be made to deal with the namespacing at that level -- then if she
messes up the multi-field configuration, it's on them (with the help of SOLR
reporting a helpful exception and area in the schema where the collision
occurred)...

Cheers,
Chris


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





Re: SOLR-1131 - Multiple Fields per Field Type

2009-11-28 Thread Erik Hatcher
What about rather than conflating field types for creating multiple  
fields, use update processors to do the this expansion instead?


Erik


On Nov 26, 2009, at 10:04 AM, Grant Ingersoll wrote:



On Nov 25, 2009, at 8:24 PM, Chris Hostetter wrote:



I'm having a hard time wrapping my head arround this entire  
concept ... i

know part of my problem is that your example use case seems somewhat
nonsensical...

: As a simple proof of concept, imagine that I define a new FieldType
: called PlusMinusIntFieldType that extends IntField.  This FieldType
: takes in an int value and outputs two Fields: one with the original
: value and one with the negative of the value.
...

: OK, on the search side is where it gets tricky.  The whole point  
of this
: exercise is that the details are hidden from the user in the  
generic

: case.  Thus, a query of plusMinus:5 should automatically expand to
: (plusMinus__0:5 OR plusMinus__1:-5).  Of course, an expert user  
should


...nothing could match plusMinus__0:5 that didn't also match
plusMinus__1:-5, so i don't really understand what the point of  
using the
field expansion for a use case like this would be ... and that's  
making it

hard for me to try and understand how this sort of system
could/should/would be used at query time.


Kind of, if a user just inputs plusMinus:5, then sure, but they may  
also want to just search the negative portion.  More importantly,  
though, they may have a QParser or some other component that can  
appropriately select one of the fields w/o the user knowing.




perhaps a more realistic example would be helpful?

...or even some differnet simple and contrived examples that  
demonstrate

how this could be usefull in a way that isn't possible with a single
field.


OK, a more concrete example is spatial.  A user will want to index a  
point as a lat lon.   So, they index:  field name=latLon49, -79/ 
field.


The implementation of how this gets indexed can be done in several  
ways.  For starters, it can be represented as a single field using  
Geohash or even just as a string (even if that isn't useful for  
much).  We don't need S-1131 for that at all.  Next, they may just  
want to represent it as a two fields:  one for the lat and one for  
the long.  Again, not super hard to do now, but it requires the user  
to set it up, whereas with a LatLonFieldType, this would be hidden  
from them.  Finally, consider the cartesian tier case.  In this  
case, a single lat lon point could be mapped to a whole slew of  
tiers, where each tier is like a zoom level on a map application  
(like Google Maps).  Here, we could have a CartesianTierFieldType,  
that takes in the lower and upper bounds of the tiers to represent,  
i.e. tier 4 through 17, and this would output 13 different  
fields.Local Solr currently handles this through dynamic fields  
and user level knowledge of the magic fields used.


For this case, there are several different search patterns:
1. The user may know the tier they want to search at and thus input  
tier and a zoom level.
2. User invokes a QParser to build a bounding box (see https://issues.apache.org/jira/browse/SOLR-1568) 
 and the Parser is responsible for creating a filter that chooses  
the most appropriate tier to search against.  So, the user might  
just say:  {!tier lat=X lon=Y dist=10} and it will pick the most  
appropriate tier, whereas putting in dist=50 would likely pick a  
different tier.


Does that help?

BTW, all of this is tracked via SOLR-773.




Re: SOLR-1131 - Multiple Fields per Field Type

2009-11-28 Thread Grant Ingersoll

On Nov 28, 2009, at 3:45 AM, Erik Hatcher wrote:

 What about rather than conflating field types for creating multiple fields, 
 use update processors to do the this expansion instead?

How do you maintain the semantic information needed at search time?  Are you 
still having the field type (or schema or something accessible by search) be 
aware of the change?

Re: SOLR-1131 - Multiple Fields per Field Type

2009-11-28 Thread Chris Male
Hi,

Aren't search semantics the responsibility of a Query Parser and Querys
themselves?  Just as the semantics of boolean queries are handled by the
standard Query parsers and BooleanQuery.

On Sat, Nov 28, 2009 at 3:17 PM, Grant Ingersoll gsing...@apache.orgwrote:


 On Nov 28, 2009, at 3:45 AM, Erik Hatcher wrote:

  What about rather than conflating field types for creating multiple
 fields, use update processors to do the this expansion instead?

 How do you maintain the semantic information needed at search time?  Are
 you still having the field type (or schema or something accessible by
 search) be aware of the change?


Re: SOLR-1131 - Multiple Fields per Field Type

2009-11-28 Thread Yonik Seeley
On Sat, Nov 28, 2009 at 9:41 AM, Chris Male gento...@gmail.com wrote:
 Aren't search semantics the responsibility of a Query Parser and Querys
 themselves?  Just as the semantics of boolean queries are handled by the
 standard Query parsers and BooleanQuery.

At a certain point, one needs polymorphic behavior to do the right
thing (unless you hard-code all field type info into the query
parser).  This is already done via fieldType to control how range
queries, function queries, sort fields, etc, are created.

We *could* encode all of the info for both lat/lon in a single field,
but it would be more work since Lucene fieldcaches, numeric range
queries, etc, don't support that.  Practically, it seems easiest to
allow a single fieldType to use more than one internal field.

-Yonik
http://www.lucidimagination.com


Re: SOLR-1131 - Multiple Fields per Field Type

2009-11-28 Thread Chris Male
Hi,

There is some standardization of the syntax and semantics of range queries,
function queries and sorting that exists outside of the field types
themselves though.  For example for range queries FieldType expects there is
just 2 values that define the range I think.  Thats a requirement that is
enforced by the Query Parser.

By allowing each FieldType to have its own search semantics, you are going
to have to let them do their own parsing too.  For Grant's example of a
PlusMinus kind of field, its possible to support it through term query like
syntax so no custom parsing has to occur, but for other types of fields that
have multiple fields, that might not be possible.

In these situations is a custom Query Parser going to be necessary?

On Sat, Nov 28, 2009 at 4:35 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Sat, Nov 28, 2009 at 9:41 AM, Chris Male gento...@gmail.com wrote:
  Aren't search semantics the responsibility of a Query Parser and Querys
  themselves?  Just as the semantics of boolean queries are handled by the
  standard Query parsers and BooleanQuery.

 At a certain point, one needs polymorphic behavior to do the right
 thing (unless you hard-code all field type info into the query
 parser).  This is already done via fieldType to control how range
 queries, function queries, sort fields, etc, are created.

 We *could* encode all of the info for both lat/lon in a single field,
 but it would be more work since Lucene fieldcaches, numeric range
 queries, etc, don't support that.  Practically, it seems easiest to
 allow a single fieldType to use more than one internal field.

 -Yonik
 http://www.lucidimagination.com



Re: SOLR-1131 - Multiple Fields per Field Type

2009-11-28 Thread Yonik Seeley
On Sat, Nov 28, 2009 at 10:51 AM, Chris Male gento...@gmail.com wrote:
 By allowing each FieldType to have its own search semantics

We're far enough removed from an actual feature, I'm not sure if we're
disagreeing about anything concrete :-)

Going back to Grant's original question, I think it's just a matter of
documentation for specific fieldTypes.
A certain fieldType like geo1 could index the lat and lon as separate
numeric fields.  That would be a specific behavior to that fieldType
that users would know about if they wanted to construct elementary
queries on lat or lon only.  That would not be supported in all geo
types of course - it's an implementation detail that may or may not be
part of the interface... it depends on what makes sense for the
particular fieldType.

I don't think it's useful to somehow programmatically access the list
of fields that a fieldType could output.

-Yonik
http://www.lucidimagination.com


Re: SOLR-1131 - Multiple Fields per Field Type

2009-11-28 Thread Chris Hostetter

: I don't think it's useful to somehow programmatically access the list
: of fields that a fieldType could output.

based on my understanding of the potential types of use cases we're 
talking about, i think i agree with you.  It seems like the most crucial 
aspect is that a FieldType has a way of producing multiple 
o.a.l.document.Field instances (potentially with different field names) 
from a single String input at index time.  this can be done with something 
like the API that Grant mentioned earlier.

For anything except non-trivial use cases, any code (like a query parser) 
attempting to deal with this fields is going to need to be very special 
purpose and have direct knowledge of the code in the FieldType.  if a 
CartesienGeoSearchQParser is asked to parse 
store_location:89.3,45.4~5miles it can throw a parse exception if 
IndexSchema.getFieldType(store_location) isn't an instanceof 
CartesianGeoSearchFieldType -- assuming it is: it can cast it and call 
CartesianGeoSearchFieldType specific methods to find out everything it 
needs to know about what multitudes of field names that specific instance 
produced based on it's configuration.  (side thought: we may want to add a 
getFieldTypesByClass method to the IndexSchema so QParsers and 
SearchComponents can get lists of fields matching special cases they want 
to know about -- but that's a secondary concern)

One thing that concerns me is potential field name collision -- where one 
of these new multifield producing FieldTypes might want to creat a name 
that happens to collide with a field the user has already declared.  

Using Double underscores kind of feels like a hack, what i keep wondering 
is if we can't leverage dynamicFields here.  if we require that these 
FieldTypes be declared using dynamicField delcarations (they could error 
on init otherwise) then the wildcard nature of the name tells the 
FieldType where it's allowed to add things to the pattern to make unique 
field names in the index -- and they can still be used as true dynamic 
fields, as long as they always add to the field name given to them. 
something like...

   dynamicField name=location* type=geo1 /

can be use to index a single location field (internal construction 
location_lat and location_lon or it could be used to support a 
location_start and location_end field (using 
location_start_lat+location_start_lon and 
location_end_lat+location_end_lon)



-Hoss



Re: SOLR-1131 - Multiple Fields per Field Type

2009-11-28 Thread Mattmann, Chris A (388J)
Hey Hoss, 

On 11/28/09 4:37 PM, Chris Hostetter hossman_luc...@fucit.org wrote:

 One thing that concerns me is potential field name collision -- where one
 of these new multifield producing FieldTypes might want to creat a name
 that happens to collide with a field the user has already declared.

Since FieldTypes are provided an instance of o.a.solr.schema.IndexScehma,
couldn't these special MultiFieldTypes just call #getFieldOrNull(fieldName)
to find out if an internal field they want to name is available in the
schema namespace?

Cheers,
Chris


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





Re: SOLR-1131 - Multiple Fields per Field Type

2009-11-28 Thread Yonik Seeley
On Sat, Nov 28, 2009 at 7:37 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:
 Using Double underscores kind of feels like a hack, what i keep wondering
 is if we can't leverage dynamicFields here.

This is what the prototype patch does I just put up on SOLR-1131.
I've gone with option A for now (see cut'n'pated comments from
DualPointType) but it's up to the specific FieldType.

/**
 * Two possible ways to allow the user to specify the __lat and
__lon fields:
 *
 * A: let them specify the complete suffix for the lat field and
lon field.  It
 * is up to them to make sure those dynamic field types are
defined in the schema.
 * Advantages: no new mechanism needed to add fieldTypes during
the initialization
 * of another fieldType.
 * Disadvantages: more clutter in the schema, lack of control over
what fieldTypes
 * are used, need to delegate absolutely everything through the
subFieldType since
 * we don't actually know what it is.
 *
 * B: Have the TriePointType create and insert the fieldtypes for
the lat and lon
 * fields itself.
 * Advantages: less clutter in the schema, more control over the
exact numeric field
 * type.
 * Disadvantages: dynamically adding new types not currently supported,
 *   less customizability
 *
 */

-Yonik
http://www.lucidimagination.com


Re: SOLR-1131 - Multiple Fields per Field Type

2009-11-26 Thread Grant Ingersoll

On Nov 25, 2009, at 8:24 PM, Chris Hostetter wrote:

 
 I'm having a hard time wrapping my head arround this entire concept ... i 
 know part of my problem is that your example use case seems somewhat 
 nonsensical...
 
 : As a simple proof of concept, imagine that I define a new FieldType 
 : called PlusMinusIntFieldType that extends IntField.  This FieldType 
 : takes in an int value and outputs two Fields: one with the original 
 : value and one with the negative of the value.
   ...
 
 : OK, on the search side is where it gets tricky.  The whole point of this 
 : exercise is that the details are hidden from the user in the generic 
 : case.  Thus, a query of plusMinus:5 should automatically expand to 
 : (plusMinus__0:5 OR plusMinus__1:-5).  Of course, an expert user should 
 
 ...nothing could match plusMinus__0:5 that didn't also match 
 plusMinus__1:-5, so i don't really understand what the point of using the 
 field expansion for a use case like this would be ... and that's making it 
 hard for me to try and understand how this sort of system 
 could/should/would be used at query time.

Kind of, if a user just inputs plusMinus:5, then sure, but they may also want 
to just search the negative portion.  More importantly, though, they may have a 
QParser or some other component that can appropriately select one of the fields 
w/o the user knowing.

 
 perhaps a more realistic example would be helpful?
 
 ...or even some differnet simple and contrived examples that demonstrate 
 how this could be usefull in a way that isn't possible with a single 
 field.

OK, a more concrete example is spatial.  A user will want to index a point as a 
lat lon.   So, they index:  field name=latLon49, -79/field.

The implementation of how this gets indexed can be done in several ways.  For 
starters, it can be represented as a single field using Geohash or even just as 
a string (even if that isn't useful for much).  We don't need S-1131 for that 
at all.  Next, they may just want to represent it as a two fields:  one for the 
lat and one for the long.  Again, not super hard to do now, but it requires the 
user to set it up, whereas with a LatLonFieldType, this would be hidden from 
them.  Finally, consider the cartesian tier case.  In this case, a single lat 
lon point could be mapped to a whole slew of tiers, where each tier is like a 
zoom level on a map application (like Google Maps).  Here, we could have a 
CartesianTierFieldType, that takes in the lower and upper bounds of the tiers 
to represent, i.e. tier 4 through 17, and this would output 13 different 
fields.Local Solr currently handles this through dynamic fields and user 
level knowledge of the magic fields used.

For this case, there are several different search patterns: 
1. The user may know the tier they want to search at and thus input tier and a 
zoom level.  
2. User invokes a QParser to build a bounding box (see 
https://issues.apache.org/jira/browse/SOLR-1568) and the Parser is responsible 
for creating a filter that chooses the most appropriate tier to search against. 
 So, the user might just say:  {!tier lat=X lon=Y dist=10} and it will pick the 
most appropriate tier, whereas putting in dist=50 would likely pick a different 
tier.

Does that help?

BTW, all of this is tracked via SOLR-773.

Re: SOLR-1131 - Multiple Fields per Field Type

2009-11-25 Thread Chris Hostetter

I'm having a hard time wrapping my head arround this entire concept ... i 
know part of my problem is that your example use case seems somewhat 
nonsensical...

: As a simple proof of concept, imagine that I define a new FieldType 
: called PlusMinusIntFieldType that extends IntField.  This FieldType 
: takes in an int value and outputs two Fields: one with the original 
: value and one with the negative of the value.
...

: OK, on the search side is where it gets tricky.  The whole point of this 
: exercise is that the details are hidden from the user in the generic 
: case.  Thus, a query of plusMinus:5 should automatically expand to 
: (plusMinus__0:5 OR plusMinus__1:-5).  Of course, an expert user should 

...nothing could match plusMinus__0:5 that didn't also match 
plusMinus__1:-5, so i don't really understand what the point of using the 
field expansion for a use case like this would be ... and that's making it 
hard for me to try and understand how this sort of system 
could/should/would be used at query time.

perhaps a more realistic example would be helpful?

...or even some differnet simple and contrived examples that demonstrate 
how this could be usefull in a way that isn't possible with a single 
field.

?

-Hoss