Re: Namespaces in response (SOLR-1586)

2009-12-21 Thread Chris Hostetter

:  eh ... agree to disagree i guess. it seems just as valid to say that
:  UpdateCommand -- what type of data does it update? ... or that
:  RequestHandler is ambigious because it can only handle Solr requests,
:  so it should be title SolrRequestHandler.
: 
: True! I guess it's just aesthetics. I can go either way, but I dunno. (and
: yes, just to be a pest, What type of data does that UpdateCommand update?)

Isn't it obvious from the context? ... Solr Data  :)

(i think that's the first, and last, time i've used an emoticon on a 
lucene mailing list )

: You give a little, you get a little back. Maybe a compromise is to called it
: NamedListResponseWriter, b/c that's really what it writes, no? Naming can be

By that logic every ResponseWriter is a NamedListResponseWriter, and a 
StringResponseWriter and a MapResponseWriter ... at a certain point you 
have to just trust that people will read the docs, you can't encode every 
bit of knowledge about hte code base into the names.


-Hoss



Re: Namespaces in response (SOLR-1586)

2009-12-15 Thread Chris Hostetter

:  a SolrQueryResponse, no one has ever accused any of those response writers
:  of not being flexible enough to generate a *different* type of response in
:  those formats.
: 
: You may be right, but actually quite a few issues have referenced even non
: XMLWriters of similar issues. See:

I honeslty don't understand what you're getting at here, this list of 
issues is all over the map and almost none of them relate to the 
extensibility of any request handlers...

: http://issues.apache.org/jira/browse/SOLR-1616
  ... this was from someone who didn't notice json.nl=arrarr and 
  felt like the default way of representing a NamedList in JSON was odd.  
  they didn't disagree with the JSON structure, they just don't like the 
  default.
: http://issues.apache.org/jira/browse/SOLR-358
  ...this was an improvement issue to track adding the ruby response 
  writer ... which idnd't exist before this.
: http://issues.apache.org/jira/browse/SOLR-1555
  ...this is a bug in how the term compontent adds the terms to the 
  response ... it's completley orthoginal to the response output 
  structure.
: http://issues.apache.org/jira/browse/SOLR-431
  ...this is from one of my coworkers who had some really old, really 
  hideously hackish plugins from before Solr was open sourced that was 
  trying to find a way to work arround a big fixed in the xml escaping -- 
  i could maybe see this as a response writers need to be more flexible 
  type issue, except they knew from the start the start they were abusing 
  a bug.
: http://issues.apache.org/jira/browse/SOLR-912
  ...this is an issue Kay opened to revamp NamedList to be more typesafe 
  ... also has absolutely nothign to do with how flexible the output 
  representation is.

: Maybe, maybe not. I'm not sure the effect is to make it crystal clear as
: much as it is to make it clearer. XMLWriter is totally ambiguous -- what
: type of XML does it generate? I would argue SOLR response XML, hence the
: SorlXmlResponseWriter.

eh ... agree to disagree i guess. it seems just as valid to say that 
UpdateCommand -- what type of data does it update? ... or that 
RequestHandler is ambigious because it can only handle Solr requests, 
so it should be title SolrRequestHandler.

we have enough ambiguity and confusion with some of our config file 
options and names that non-java users see ... the ones that only plugin 
writers see i'm less concerned with ... better to beef up the javadocs 
that deal with a bunch of deprecation headaches just to add Solr to the 
front of a class name.


-Hoss



Re: Namespaces in response (SOLR-1586)

2009-12-15 Thread Mattmann, Chris A (388J)
Hi Hoss:

On 12/15/09 6:39 PM, Chris Hostetter hossman_luc...@fucit.org wrote:

 
 
 :  a SolrQueryResponse, no one has ever accused any of those response writers
 :  of not being flexible enough to generate a *different* type of response in
 :  those formats.
 :
 : You may be right, but actually quite a few issues have referenced even non
 : XMLWriters of similar issues. See:
 
 I honeslty don't understand what you're getting at here, this list of
 issues is all over the map and almost none of them relate to the
 extensibility of any request handlers...

They may be all over the map, but in general they address your statement
about non-XML response writers being flexible enough to generate a
different type of response (although admittedly, none are as clear at the
XMLWriter examples, I'll give you that). The examples I gave were just based
on a quick search of JIRA.

 : Maybe, maybe not. I'm not sure the effect is to make it crystal clear as
 : much as it is to make it clearer. XMLWriter is totally ambiguous -- what
 : type of XML does it generate? I would argue SOLR response XML, hence the
 : SorlXmlResponseWriter.
 
 eh ... agree to disagree i guess. it seems just as valid to say that
 UpdateCommand -- what type of data does it update? ... or that
 RequestHandler is ambigious because it can only handle Solr requests,
 so it should be title SolrRequestHandler.

True! I guess it's just aesthetics. I can go either way, but I dunno. (and
yes, just to be a pest, What type of data does that UpdateCommand update?)

 
 we have enough ambiguity and confusion with some of our config file
 options and names that non-java users see ... the ones that only plugin
 writers see i'm less concerned with ... better to beef up the javadocs
 that deal with a bunch of deprecation headaches just to add Solr to the
 front of a class name.

You give a little, you get a little back. Maybe a compromise is to called it
NamedListResponseWriter, b/c that's really what it writes, no? Naming can be
a pain -- I'll try and think of a good one when I'm preparing the patch for
SOLR-1649.

Thanks for the discussion. Helps to clarify things!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department University of
Southern California, Los Angeles, CA 90089 USA
++




Re: Namespaces in response (SOLR-1586)

2009-12-14 Thread Chris Hostetter

: I'm conflicted here. In simple semantics, sure it's just an array of
: float/double numbers. A, if a string must be used a comma is probably OK, so
: long as it maps to some existing known approach to represent points. I've
: asked several times if there are examples. I can point to one that uses
: spaces to separate the coordinates in the point (georss). What others use
: comma? 

I have no opinion about the details ... space seperated string, comma 
seperated string, list of ints ... they are all the same to me.

As a layman, my limited knowledge of geo coordinates has a vague notion 
that comma is the seperated used when discussing latitude nad longitute, 
but i have no real knowledge of naything GIS related.  (i think i remember 
that KML uses comma, but KML also has some weird idea that longitude comes 
first because that's what the guys writing graphics rendering engines 
aparently like: y-axis first)

: Well, I actually would disagree. What's the point of #toInternal and
: #toExternal then, other than to convert from the external representation to
: an internal Lucene index representation, and then to do the opposite coming
: out of the index? 

that is what they are for -- but they deal purely in string 
representations of hte data itself -- they don't (and shouldn't) know/care 
wether the data is then being encapsulted in JSON, thrift, Avro, Solr XML, 
RSS, KML, etc

The String limitation of toExternal is on of the reasons toObject was 
added (and the reason the BinaryResponseWRiter uses toObject()).

: class final which it once was). We should rename that to
: SolrXmlResponseWriter, but it's not really generic XML (as the name
: suggests), it's SOLR's custom (undocumented) XML schema, right? Also, since

Eh... i don't know that the name suggests that it can generate generic 
XML, it generates a (particular) one to one mapping from the 
SolrQueryResponse to XML .. just like the JSONResponseWriter generates a 
one to one mapping fromthe SolrQueryResponse to JSON, and ditoo for the 
ruby/php/python writers ... there an infinite number of possible 
XML/JSON/Ruby/PHP/Python/etc. structures that *could* be generated from 
a SolrQueryResponse, no one has ever accused any of those response writers 
of not being flexible enough to generate a *different* type of response in 
those formats.

And practicle speaking: slapping Solr in front of a response writer 
classname isn't going to make it crystal clear that it produces a solr 
specific type of .  It's oging to make people think it's the 
Solr implemntation of .  Solr is hte prefix of enough classnames 
that eyeballs are just going to gloss over it.

: suggests), it's SOLR's custom (undocumented) XML schema, right? Also, since
: it's undocumented, I'd be happy to throw it together for it's XML format.

we actaully went round and round on documenting it back in the early days 
.. frequently it was deemed self documenting enough for end users so not 
much effort was ever put into it.  there was a Jira issue to create and 
XSD, but even once we had one, no one really had any idea what to *do* 
with it...

https://issues.apache.org/jira/browse/SOLR-17


: Would that also be welcomed? Then, we should develop an easy extension point
: mechanism for people who want to develop their own XML response writers and
: write their own clients (or leverage existing clients that understand that
: XML).

+1

I think the crux of this would be XML based response writer similar to hte 
BinaryResponseWriter that can use a codec type system for outputing 
known types of objects, using FiledType.toOBject() to get field values.  
Then we just have to provide default codecs for all the types of objects 
we produce out of the box, but people can customize with their own 
codecs if they want differnet representation.


-Hoss



Re: Namespaces in response (SOLR-1586)

2009-12-14 Thread Mattmann, Chris A (388J)
Hi Hoss,

On 12/14/09 3:18 PM, Chris Hostetter hossman_luc...@fucit.org wrote:

 
 : Well, I actually would disagree. What's the point of #toInternal and
 : #toExternal then, other than to convert from the external representation to
 : an internal Lucene index representation, and then to do the opposite coming
 : out of the index?
 
 that is what they are for -- but they deal purely in string
 representations of hte data itself -- they don't (and shouldn't) know/care
 wether the data is then being encapsulted in JSON, thrift, Avro, Solr XML,
 RSS, KML, etc
 
 The String limitation of toExternal is on of the reasons toObject was
 added (and the reason the BinaryResponseWRiter uses toObject()).

Conceptually I think that the best approach would be to do something similar
to the functionality of #toObject, but to not call it that. #toInternal and
#toExternal are actually good names, their interface is just off (they
shouldn't return Strings).

 
 : class final which it once was). We should rename that to
 : SolrXmlResponseWriter, but it's not really generic XML (as the name
 : suggests), it's SOLR's custom (undocumented) XML schema, right? Also, since
 
 Eh... i don't know that the name suggests that it can generate generic
 XML, it generates a (particular) one to one mapping from the
 SolrQueryResponse to XML .. just like the JSONResponseWriter generates a
 one to one mapping fromthe SolrQueryResponse to JSON, and ditoo for the
 ruby/php/python writers ... there an infinite number of possible
 XML/JSON/Ruby/PHP/Python/etc. structures that *could* be generated from
 a SolrQueryResponse, no one has ever accused any of those response writers
 of not being flexible enough to generate a *different* type of response in
 those formats.

You may be right, but actually quite a few issues have referenced even non
XMLWriters of similar issues. See:

http://issues.apache.org/jira/browse/SOLR-1616
http://issues.apache.org/jira/browse/SOLR-358
http://issues.apache.org/jira/browse/SOLR-1555
http://issues.apache.org/jira/browse/SOLR-431
http://issues.apache.org/jira/browse/SOLR-912

 
 And practicle speaking: slapping Solr in front of a response writer
 classname isn't going to make it crystal clear that it produces a solr
 specific type of .  It's oging to make people think it's the
 Solr implemntation of .  Solr is hte prefix of enough classnames
 that eyeballs are just going to gloss over it.

Maybe, maybe not. I'm not sure the effect is to make it crystal clear as
much as it is to make it clearer. XMLWriter is totally ambiguous -- what
type of XML does it generate? I would argue SOLR response XML, hence the
SorlXmlResponseWriter.

 
 : suggests), it's SOLR's custom (undocumented) XML schema, right? Also, since
 : it's undocumented, I'd be happy to throw it together for it's XML format.
 
 we actaully went round and round on documenting it back in the early days
 .. frequently it was deemed self documenting enough for end users so not
 much effort was ever put into it.  there was a Jira issue to create and
 XSD, but even once we had one, no one really had any idea what to *do*
 with it...
 
 https://issues.apache.org/jira/browse/SOLR-17

I commented on SOLR-17 on what could be done with it, and I linked it to the
new issue I threw up: SOLR-1646. Both can be closed at the same time, or
even better, I can close SOLR-1646 and then work diligently on trying to get
SOLR-17 committed. Even for documentation purposes it's well worth while.

 
 
 : Would that also be welcomed? Then, we should develop an easy extension point
 : mechanism for people who want to develop their own XML response writers and
 : write their own clients (or leverage existing clients that understand that
 : XML).
 
 +1
 
 I think the crux of this would be XML based response writer similar to hte
 BinaryResponseWriter that can use a codec type system for outputing
 known types of objects, using FiledType.toOBject() to get field values.
 Then we just have to provide default codecs for all the types of objects
 we produce out of the box, but people can customize with their own
 codecs if they want differnet representation.

+1!

Thanks, Hoss.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department University of
Southern California, Los Angeles, CA 90089 USA
++




Re: Namespaces in response (SOLR-1586)

2009-12-11 Thread Chris Hostetter
:  : I think the initial geosearch feature can start off with
:  : str10,20/str for a point.
:  
:  +1.
: 
: Fundamentally, how is a string a point?

Fundementally a string is not a point, and a point is not a string -- but 
if you want express the concept of a point in a manner that only uses very 
simple primative types, then a string containing comma seperated numbers 
is a pretty dencet way to do it.  If you'd prefer, a pair of numbers would 
workd just as well...

   arrfloat10/floatfloat20/float/arr

:  The current XML format SOlr uses was designed to be extremely simple, very
:  JSON-esque, and easily parsable by *anyone* in any langauge, without
:  needing special knowledge of types .
: 
: Whoah. I'm totally confused now. Why have FieldTypes then? When not just use
: Lucene? The use case for FieldTypes is _not_ just for indexing, or querying.
: It's also for representation?

No, actually the use case for FieldTYpes is entirely about the internal 
logic of how Solr should deal with those fields, and how various 
operations should work on them.  FieldTypes can dictate the internal 
representation within the confines of a Lucene index, but they should not 
circumvent the contracts of the response writers in dictating what 
is/isn't a legal response.

XMLWriter.writePrim may be public, which means there is a loophole that 
plugin writers can exploit to add new tag names to the Solr XML response that
violate the contract (and no we don't have a formal XSD or DTD for our 
XML response format, but we still have a very well advertised contract) -- 
but that doesn't mean that code which ships with Solr should exploit those 
loopholes to violate that contract.  People should expect that if they use 
Solr as is without any custom code that the XMLResponseWriter won't all of 
the sudden start including new, non-primitive-ish, XML tags/attributes 
that weren't there before.

That's the entire point of the format as it was designed: break down 
whatever complex data might be involved in a response into easily 
digestible maps/lists of maps/lists of very primitive types that can 
easily be used in any programming langauge.

: allowed for a while I think), why prevent it? Allowing namespaces does _not_
: break anything. 
...
:  introducing a new 'point concept, wether as point or as
:  georss:point/, is going to break things for people.
: 
: Show me an example, I fundamentally disagree with this.

Ok. Let's start with SolrJ then: take a look at the KnownType enum (line 
151) in XMLResponseParser...

http://svn.apache.org/viewvc/lucene/solr/trunk/src/solrj/org/apache/solr/client/solrj/impl/XMLResponseParser.java?revision=819403view=markup

...or let's do a random google code search for solr xml lst -- check out 
ResponseContentHandler in solrpy...

http://code.google.com/p/solrpy/source/browse/trunk/solr/core.py#841

...I can't write python code to save my life, but I have pretty good idea 
what that code will do if it sees an unexpected tag.

This is how a *LOT* of SOlr client libraries are implemented ... it's not 
an issue of broken XML parsers freaking out about namespaces, it's an 
issue of having a long standing, heavily advertised schema for the XML 
response that promises to only ever use a handful of types.  Adding any 
new tags to this format (regardless of how easy it may be because of that 
stupid fucking public modifier on XMLWuiter.writePrim) will absolutely 
break things for people.

: And why is that? Isn't the point of SOLR to expand to use cases brought up
: by users of the system? As long as those use cases can be principally
: supported, without breaking backwards compatibility (or in that case,  if
: they do, with large blinking red text that says it), then you're shutting
: people out for 0 benefit? It's aesthetics we're talking about here.

I don't know if i'd say that's the point of Solr, but yes we should 
absolutely try to grow the capabilities of the system as new use cases 
come along.

I am 100% in agreement that the existing simple XMLRresponseWriter is 
not for everyone -- Historicly we've tried to maintain a sense of equality 
between all of hte Response writers, so that they all contained the same 
data just with different markup -- but there are clearly cases where it 
would be nice to have a response writer that is allowed to know more 
about teh real structure of the data and represent it in a manner that 
more closely represents it's purpose.  This was the entire point behind 
adding FieldType.toOBject, and UUIDFIeld w/the BinaryResponseWriter is a 
good example of the model we should follow in the future.

There is a clear push for Solr to natively be able to generated responses that 
incorporate more industry standard XML schemas, and i would love to see 
us start adding functionality to do that, but bastardizing the existing 
XMLResponseWriter format is not the way to do it.

Bottom Line: I am a big fat -1 on any patch to Solr that adds new xml tags 
to the output 

Re: Namespaces in response (SOLR-1586)

2009-12-11 Thread Chris Hostetter

:  themselves ... because of the back-ass-wards way we have FieldTypes write
:  their values directly to an XMLWriter or a TextWriter the idea of using an
:  object that stringifies itself as needed doesn't really apply very well
: 
: I think it's rather powerful. You insulate the following variations into 1
: single place to change them (FieldType):
: 
: * output representation
: * indexing
: * validation
: 
: To remove this from FieldType would be to strew the same functionality
: across multiple classes, which doesn't make sense IMHO.

it's a damned-if-you-do/damned-if-you-don't situation though ... you look 
at as insulating the response writers because all of the logic about 
serializing data is in the FieldType, but i look at it as poluting the 
FieldType with knowledge about the output formats -- there's a reason we 
didn't add writeBinary to the FieldTYpe when the BinaryResponseWriter 
was added ... the toObject abstraction let's the FieldType do whatever it 
wants internally, and provide it's best face to the world when asked.  
the ResponseWriters can then apply hueristics to decide the most 
compatible type they know of to use when representing it: is it something 
complex i have a codec for? no; oh well, then is it soemthing that 
implemnets COllection? no; oh well, then is it something that is an 
instanceof Number? no; oh well, as a last resort we can stringify

: In the long run, this might be nice, and +1 on getting there in the long
: run. In the short, a compromise is to allow namespacing on fields in the
: existing XmlWriter, which is allowed anyways, whether by oversight or not.

I'm sure if we look hard enough at teh existing internal APIs, we can find 
a way to generate completley broken XML that no DOM, SAX or pull parser 
could possibly deal with cleanly -- but that doesn't mean we should do 
that just because it would allow us to start outputing a bunch of metadata 
that we think is useful.  breaking the (implicit) XML Schema is just as 
bad as breaking the XML itself.



-Hoss



Re: Namespaces in response (SOLR-1586)

2009-12-11 Thread Mattmann, Chris A (388J)
Hi Hoss,

 :  : I think the initial geosearch feature can start off with
 :  : str10,20/str for a point.
 : 
 :  +1.
 :
 : Fundamentally, how is a string a point?
 
 Fundementally a string is not a point, and a point is not a string -- but
 if you want express the concept of a point in a manner that only uses very
 simple primative types, then a string containing comma seperated numbers
 is a pretty dencet way to do it.  If you'd prefer, a pair of numbers would
 workd just as well...
 
arrfloat10/floatfloat20/float/arr

I'm conflicted here. In simple semantics, sure it's just an array of
float/double numbers. A, if a string must be used a comma is probably OK, so
long as it maps to some existing known approach to represent points. I've
asked several times if there are examples. I can point to one that uses
spaces to separate the coordinates in the point (georss). What others use
comma? 

 
 :  The current XML format SOlr uses was designed to be extremely simple, very
 :  JSON-esque, and easily parsable by *anyone* in any langauge, without
 :  needing special knowledge of types .
 :
 : Whoah. I'm totally confused now. Why have FieldTypes then? When not just use
 : Lucene? The use case for FieldTypes is _not_ just for indexing, or querying.
 : It's also for representation?
 
 No, actually the use case for FieldTYpes is entirely about the internal
 logic of how Solr should deal with those fields, and how various
 operations should work on them.  FieldTypes can dictate the internal
 representation within the confines of a Lucene index, but they should not
 circumvent the contracts of the response writers in dictating what
 is/isn't a legal response.

Well, I actually would disagree. What's the point of #toInternal and
#toExternal then, other than to convert from the external representation to
an internal Lucene index representation, and then to do the opposite coming
out of the index? 

 
 : allowed for a while I think), why prevent it? Allowing namespaces does _not_
 : break anything.
 ...
 :  introducing a new 'point concept, wether as point or as
 :  georss:point/, is going to break things for people.
 :
 : Show me an example, I fundamentally disagree with this.
 
 Ok. Let's start with SolrJ then: take a look at the KnownType enum (line
 151) in XMLResponseParser...
 
 http://svn.apache.org/viewvc/lucene/solr/trunk/src/solrj/org/apache/solr/clien
 t/solrj/impl/XMLResponseParser.java?revision=819403view=markup

Got it. OK, sure, well thanks for actually being able to identify somewhere
where it would be and for taking the time to provide a link. So what you are
saying is that this breaks the SolrJ and python clients and people who
develop clients to parse and read the (undocumented) SOLR response schema.

 
 ...or let's do a random google code search for solr xml lst -- check out
 ResponseContentHandler in solrpy...
 
 http://code.google.com/p/solrpy/source/browse/trunk/solr/core.py#841
 
 ...I can't write python code to save my life, but I have pretty good idea
 what that code will do if it sees an unexpected tag.

Gotcha.


 : And why is that? Isn't the point of SOLR to expand to use cases brought up
 : by users of the system? As long as those use cases can be principally
 : supported, without breaking backwards compatibility (or in that case,  if
 : they do, with large blinking red text that says it), then you're shutting
 : people out for 0 benefit? It's aesthetics we're talking about here.
 
 I don't know if i'd say that's the point of Solr, but yes we should
 absolutely try to grow the capabilities of the system as new use cases
 come along.

Well that's what I was trying to do, but all I was hearing was a lot of
hollering without any help to understand why. Thanks for being the one to
finally provide that information.

 
 I am 100% in agreement that the existing simple XMLRresponseWriter is
 not for everyone -- Historicly we've tried to maintain a sense of equality
 between all of hte Response writers, so that they all contained the same
 data just with different markup -- but there are clearly cases where it
 would be nice to have a response writer that is allowed to know more
 about teh real structure of the data and represent it in a manner that
 more closely represents it's purpose.

I'd like to refactor the whole thing to be a bit less brittle, and also to
close off people that shouldn't be dealing with SOLR's XML in/out (by taking
away your favorite writePrim method and its public modifier and making the
class final which it once was). We should rename that to
SolrXmlResponseWriter, but it's not really generic XML (as the name
suggests), it's SOLR's custom (undocumented) XML schema, right? Also, since
it's undocumented, I'd be happy to throw it together for it's XML format.
Would that also be welcomed? Then, we should develop an easy extension point
mechanism for people who want to develop their own XML response writers and
write their own clients (or leverage existing clients that 

Re: Namespaces in response (SOLR-1586)

2009-12-11 Thread Mattmann, Chris A (388J)
Hi Hoss,

 : I think it's rather powerful. You insulate the following variations into 1
 : single place to change them (FieldType):
 :
 : * output representation
 : * indexing
 : * validation
 :
 : To remove this from FieldType would be to strew the same functionality
 : across multiple classes, which doesn't make sense IMHO.
 
 it's a damned-if-you-do/damned-if-you-don't situation though ... you look
 at as insulating the response writers because all of the logic about
 serializing data is in the FieldType, but i look at it as poluting the
 FieldType with knowledge about the output formats -- there's a reason we
 didn't add writeBinary to the FieldTYpe when the BinaryResponseWriter
 was added ... the toObject abstraction let's the FieldType do whatever it
 wants internally, and provide it's best face to the world when asked.
 the ResponseWriters can then apply hueristics to decide the most
 compatible type they know of to use when representing it: is it something
 complex i have a codec for? no; oh well, then is it soemthing that
 implemnets COllection? no; oh well, then is it something that is an
 instanceof Number? no; oh well, as a last resort we can stringify

Sure, it's just that it's half-way on both sides right now like you said.
There's probably a middle ground. I like the insulation but I also
understand the clutter (i.e., what you're saying).

 
 : In the long run, this might be nice, and +1 on getting there in the long
 : run. In the short, a compromise is to allow namespacing on fields in the
 : existing XmlWriter, which is allowed anyways, whether by oversight or not.
 
 I'm sure if we look hard enough at teh existing internal APIs, we can find
 a way to generate completley broken XML that no DOM, SAX or pull parser
 could possibly deal with cleanly -- but that doesn't mean we should do
 that just because it would allow us to start outputing a bunch of metadata
 that we think is useful.  breaking the (implicit) XML Schema is just as
 bad as breaking the XML itself.

Agreed. Let's document that (implicit) schema so loud people like me don't
keep bugging you guys when it's so obvious to you. I'm just trying to help.
I'll take an action.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department University of
Southern California, Los Angeles, CA 90089 USA
++




Re: Namespaces in response (SOLR-1586)

2009-12-09 Thread Mattmann, Chris A (388J)
Hi Grant, and others,

My 2 cents (and of course I'm bias having prepared the patch):

 In SOLR-1586, the proposed patch introduces the concept that a Solr response
 can declare a namespace for part of the response (in this case, it is using
 the tags defined by georss.org to specify a point, etc.).

The patch doesn't introduce this concept -- it makes use of it.
XMLWriter#writePrim took care of that for me, see Hostetter's comment:

http://www.lucidimagination.com/search/document/be6fb7ce53c2922d/jira_create
d_solr_1592_refactor_xmlwriter_starttag_to_allow_arbitrary_attributes_to_be_
writ


Since that method is public, anyone could have done this in the past, they
just chose not to. Moreover, they chose not to in the committed source for
SOLR, but others who took SOLR, prepared their own XML response writers,
etc., may have done this same thing as well.

 
 Discussion points:
 1. If there are standard namespaces, then people can use them to do fun XML
 things

+1. This includes things like validation, strong typing (see SOLR-912 for
others who also believe that the NamedList BagOfObjects structure, while
robust, introduces type confusion when unraveling the response), and
plugging in to other tools. Imagine a GIS tool that required a
georss:point to be returned back somehow. You could argue XSLT could do
this, but as you note below, it's an extra step. It also _implicitly_ ties
the representation and typing of a FieldType to something that isn't really
tied to a field type at all (an XSLT file?)

 2. If we allow them, we get all of the other benefits of namespaces...

For sure -- see above for some examples.

 3. The indexing side doesn't support them, so it seems odd to put in something
 like field name=point55.3 27.9/field and get back georss:point
 name=point 55.3 27.9/georss:point.  At the same time, it seems equally
 weird to get back str name=point.../str when there is in fact more
 semantic information available about this particular field that would
 otherwise require more work by an application to make sense of.

You got it. I'm not sure why it seems weird -- the translation from
docs/fields to external representation (via response writers or field type
representation) is one of the benefits of SOLR IMHO.

 4. If we let in other namespaces, we then are opening ourselves to longer
 responses, etc.  It is also likely the case that there isn't just one
 standard.  This likely could mean slower responses, etc.

How does adding in some characters (e.g., an ns tag and an associated URL)
add anything other than noise? We're talking the difference between O(n)
versus O(n+20) here. Also it's perfectly legit IMHO to say, well if you
introduce 10, 000 namespaces, well, that's on you, and be prepared for
slower client/server interactions.

 5. If people wanted them, they could just do XSLT, but that is an extra step
 too.

Yep, that's an extra step, and it's not explicit, like the patch I attached
is. I tried to take advantage of one of SOLR's extension points in the
architecture to explicitly tie a representation of a Field to its external
and internal representation (aka, the point of a FieldType, no?)
 
 An alternative is that we could refactor things a bit and allow the FieldType
 to specify the tag name instead of it being hardcoded in the writers.  This
 way people writing FieldTypes could define them.  For instance, we could have
 FieldType.getTagName() that could be overridden and clients could have tools
 for introspecting this.

This is basically what I did right? I did an inline namespace using a
variant of #writePrm in XMLWriter (#writeCdata) and had the
FieldType#toExternal method set the tag name, which is allowed by the API.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department University of
Southern California, Los Angeles, CA 90089 USA
++




Re: Namespaces in response (SOLR-1586)

2009-12-09 Thread Grant Ingersoll
Inline...

On Dec 9, 2009, at 9:33 AM, Mattmann, Chris A (388J) wrote:

 Hi Grant, and others,
 
 My 2 cents (and of course I'm bias having prepared the patch):
 
 In SOLR-1586, the proposed patch introduces the concept that a Solr response
 can declare a namespace for part of the response (in this case, it is using
 the tags defined by georss.org to specify a point, etc.).
 
 The patch doesn't introduce this concept -- it makes use of it.
 XMLWriter#writePrim took care of that for me, see Hostetter's comment:
 
 http://www.lucidimagination.com/search/document/be6fb7ce53c2922d/jira_create
 d_solr_1592_refactor_xmlwriter_starttag_to_allow_arbitrary_attributes_to_be_
 writ
 
 
 Since that method is public, anyone could have done this in the past, they
 just chose not to. Moreover, they chose not to in the committed source for
 SOLR, but others who took SOLR, prepared their own XML response writers,
 etc., may have done this same thing as well.
 
 
 Discussion points:
 1. If there are standard namespaces, then people can use them to do fun XML
 things
 
 +1. This includes things like validation,

Yeah, but the rest of Solr's response doesn't have it, so...

 strong typing (see SOLR-912 for
 others who also believe that the NamedList BagOfObjects structure, while
 robust, introduces type confusion when unraveling the response), and
 plugging in to other tools. Imagine a GIS tool that required a
 georss:point to be returned back somehow. You could argue XSLT could do
 this, but as you note below, it's an extra step. It also _implicitly_ ties
 the representation and typing of a FieldType to something that isn't really
 tied to a field type at all (an XSLT file?)

Agreed.

 
 2. If we allow them, we get all of the other benefits of namespaces...
 
 For sure -- see above for some examples.
 
 3. The indexing side doesn't support them, so it seems odd to put in 
 something
 like field name=point55.3 27.9/field and get back georss:point
 name=point 55.3 27.9/georss:point.  At the same time, it seems equally
 weird to get back str name=point.../str when there is in fact more
 semantic information available about this particular field that would
 otherwise require more work by an application to make sense of.
 
 You got it. I'm not sure why it seems weird -- the translation from
 docs/fields to external representation (via response writers or field type
 representation) is one of the benefits of SOLR IMHO.

It's weird b/c no XML type was specified upfront, but a type was given out on 
the back end.  It's not a show stopper or anything, just an interesting point, 
I think.

 
 4. If we let in other namespaces, we then are opening ourselves to longer
 responses, etc.  It is also likely the case that there isn't just one
 standard.  This likely could mean slower responses, etc.
 
 How does adding in some characters (e.g., an ns tag and an associated URL)
 add anything other than noise? We're talking the difference between O(n)
 versus O(n+20) here. Also it's perfectly legit IMHO to say, well if you
 introduce 10, 000 namespaces, well, that's on you, and be prepared for
 slower client/server interactions.

You'd be surprised how slow XML parsing often is, especially for larger 
responses, XML processing can be quite expensive and most of the information in 
verbose at best.   I've seen this on a number of occasions and it is why we 
switched to a binary response format in SolrJ and why I think all clients 
should speak the binary protocol.


 
 5. If people wanted them, they could just do XSLT, but that is an extra step
 too.
 
 Yep, that's an extra step, and it's not explicit, like the patch I attached
 is. I tried to take advantage of one of SOLR's extension points in the
 architecture to explicitly tie a representation of a Field to its external
 and internal representation (aka, the point of a FieldType, no?)
 
 An alternative is that we could refactor things a bit and allow the FieldType
 to specify the tag name instead of it being hardcoded in the writers.  This
 way people writing FieldTypes could define them.  For instance, we could have
 FieldType.getTagName() that could be overridden and clients could have tools
 for introspecting this.
 
 This is basically what I did right? I did an inline namespace using a
 variant of #writePrm in XMLWriter (#writeCdata) and had the
 FieldType#toExternal method set the tag name, which is allowed by the API.

As Hoss' points out on the thread, I think the longer term goal seems to be to 
be more agnostic of the FieldType, so this would argue against my proposal.

-Grant



Re: Namespaces in response (SOLR-1586)

2009-12-09 Thread Mattmann, Chris A (388J)
Hi Grant,

My replies inline as well:

 
 Discussion points:
 1. If there are standard namespaces, then people can use them to do fun XML
 things
 
 +1. This includes things like validation,
 
 Yeah, but the rest of Solr's response doesn't have it, so...
 

You mean the rest of SOLR's default response and the components that add to
it. I can, arbitrarily, as a user of SOLR, introduce as many inline xmlns
attributes (and thus declare arbitrary number of namespaces) as I want,
there is nothing that precludes me from doing so was my point.

 3. The indexing side doesn't support them, so it seems odd to put in
 something
 like field name=point55.3 27.9/field and get back georss:point
 name=point 55.3 27.9/georss:point.  At the same time, it seems equally
 weird to get back str name=point.../str when there is in fact more
 semantic information available about this particular field that would
 otherwise require more work by an application to make sense of.
 
 You got it. I'm not sure why it seems weird -- the translation from
 docs/fields to external representation (via response writers or field type
 representation) is one of the benefits of SOLR IMHO.
 
 It's weird b/c no XML type was specified upfront, but a type was given out on
 the back end.  It's not a show stopper or anything, just an interesting point,
 I think.

I actually disagree with this. FieldTypes, if we agree on a data type
representation, e.g., georss point format, or line format, etc., define
their XML representation. So, if we have a FieldType of type georss:point,
then a type _is_ given up front, it's just defined in the standard that
defines the field element.

Imagine if you wanted to standardize on something like dublin core, for
titles, formats, etc. SOLR expects a fairly simple XML structure (Documents,
with Fields, with attributes), but the advantage of SOLR over traditional
Lucene is that via FieldTypes, you can understand what the true type of the
field you are indexing is. In other words, we can say in a schema file that
e.g., this incoming title is DublinCore, so its field type is
solr.DublinCoreAuthor, which inside of the FieldType definition, tells us
how to go from the given representation to the index reprsentation
(#toINternal) and subsequently tells us how to go from the index
representation to the external representation (#toExternal).

I'm not advocating for change SOLR's input doc format for indexing -- I'm
arguing that what you guys have done is actually a great idea. Having
FieldTypes and SolrInputDocuments as separate, allows each to involve
independently of one another, but the same time, be brought back together
for the purpose of e.g., validation, (see the lat/lon validation I did in
the attached patch), response writing (for plugging into external tools),
and for representation in the Lucene index outside of plain ol' Strings.

 
 
 4. If we let in other namespaces, we then are opening ourselves to longer
 responses, etc.  It is also likely the case that there isn't just one
 standard.  This likely could mean slower responses, etc.
 
 How does adding in some characters (e.g., an ns tag and an associated URL)
 add anything other than noise? We're talking the difference between O(n)
 versus O(n+20) here. Also it's perfectly legit IMHO to say, well if you
 introduce 10, 000 namespaces, well, that's on you, and be prepared for
 slower client/server interactions.
 
 You'd be surprised how slow XML parsing often is, especially for larger
 responses, XML processing can be quite expensive and most of the information
 in verbose at best.   I've seen this on a number of occasions and it is why we
 switched to a binary response format in SolrJ and why I think all clients
 should speak the binary protocol.

Sure, XML parsing can be slow, but from your point above, you guys have
standardized on using a binary request/response format in things like SolrJ,
so what does the XML have to do this with anyways and why performance a
concern then? In the case where people want XML, in their particular format,
it's up to them to parse (and in most cases, if they are outputting a
format, there's likely already readers/etc. that exist for that format,
where things like optimizations can be delegated to).

On the other hand, let's consider XSLT, which is a big performance hit as
well, in many cases, more of a hit than simply outputting XML with the
namespaces inline. Also, let's quality this. I'm not saying we should make
SOLR's default response (and all its Components that add to the response) be
forced to use namespaces. However, it should definitely not be precluded.

 
 
 
 5. If people wanted them, they could just do XSLT, but that is an extra step
 too.
 
 Yep, that's an extra step, and it's not explicit, like the patch I attached
 is. I tried to take advantage of one of SOLR's extension points in the
 architecture to explicitly tie a representation of a Field to its external
 and internal representation (aka, the point of a 

Re: Namespaces in response (SOLR-1586)

2009-12-09 Thread Ramirez, Paul M (388J)
Hey All,


 1.  Namespaces are fun especially when you have some target format you are 
trying to work towards. Many target formats use namespaces extensively so 
having the ability to map to them on the back end (response) would be great. 
This does not mean that Solr would have to utilize namespaces at all and 
supporting them internally is a different issue. I think that was the spirit of 
the original patch.
 2.  From what I'm gathering this is a discussion of whether Solr supports them 
internally. Hopefully, there is a differentiation between internal/external 
namespace usage with Solr.
 3.  Why must the response dictate what is done internally within Solr?
 4.  Internally it would seem that these are just string mappings and how much 
impact would there really be to writing out the response?
 5.  If the shift is just to have them use XSLT my guess would be that would 
cause a slower response than direct mappings. This is solely my opinion as I 
have not done any tests but NamedList - XML - XSLT would seem logically 
slower than NamedList- (mapped) XML

Thanks,
Paul Ramirez


On 12/9/09 5:30 AM, Grant Ingersoll gsing...@apache.org wrote:

In SOLR-1586, the proposed patch introduces the concept that a Solr response 
can declare a namespace for part of the response (in this case, it is using the 
tags defined by georss.org to specify a point, etc.).  I'm not sure what to 
make of this.  My gut reaction says no, but I'm not a namespace expert and I 
also don't feel strongly about it.

Discussion points:
1. If there are standard namespaces, then people can use them to do fun XML 
things
2. If we allow them, we get all of the other benefits of namespaces...
3. The indexing side doesn't support them, so it seems odd to put in something 
like field name=point55.3 27.9/field and get back georss:point 
name=point 55.3 27.9/georss:point.  At the same time, it seems equally 
weird to get back str name=point.../str when there is in fact more 
semantic information available about this particular field that would otherwise 
require more work by an application to make sense of.
4. If we let in other namespaces, we then are opening ourselves to longer 
responses, etc.  It is also likely the case that there isn't just one standard. 
 This likely could mean slower responses, etc.
5. If people wanted them, they could just do XSLT, but that is an extra step 
too.

An alternative is that we could refactor things a bit and allow the FieldType 
to specify the tag name instead of it being hardcoded in the writers.  This way 
people writing FieldTypes could define them.  For instance, we could have 
FieldType.getTagName() that could be overridden and clients could have tools 
for introspecting this.

I'm not sure what effect any of this would have on downstream clients, either.

Thoughts?

-Grant



Re: Namespaces in response (SOLR-1586)

2009-12-09 Thread Yonik Seeley
My gut feeling is that we should not be introducing namespaces by default.
It introduces a new requirement of XML parsers in clients, and some
parsers would start validating by default, and going out to the web to
retrieve the referenced namespace/schema, etc.

I think the initial geosearch feature can start off with
str10,20/str for a point.
If we wish to introduce a point type in the XML and binary response
writers at a later point in time, it seems like it might require a
version bump of the output format anyway, and we could go to something
simple like point10,20/point.

It is worth using standards when they buy you enough I'm not sure
this is one of those times.
I'm sure there are standards for numeric types like int too... but
using namespaces for that seems like overkill.

But if someone wants to supply patches that can optionally enable
sticking in schema, namespaces, etc, w/o significant impact to the
default, that's OK too.  Or perhaps a custom response writer that uses
namespaces for every single type for those who want that.

-Yonik
http://www.lucidimagination.com


Re: Namespaces in response (SOLR-1586)

2009-12-09 Thread Yonik Seeley
On Wed, Dec 9, 2009 at 11:44 AM, Mattmann, Chris A (388J)
chris.a.mattm...@jpl.nasa.gov wrote:
 How does it introduce any new requirements? Namespaces are easily ignored by
 any XML client as they are if they weren't present. In other words, unless
 the XML client has setValidating=true, then this isn't an issue.

I've run across cases where I added a schema declaration to an XML
file and then things started failing.  I think some parsers may
default to validating if it sees that it can?

Namespaces are to avoid name clashes.  Solr XML is well defined and
not arbitrary... adding point if we wish to do so won't introduce
any clashes.

 The only difference between what you call simple above and what I've
 proposed (and correct me if I'm wrong but others have too) is that your
 point tag would include a namespace prefix and an xmlns attribute. What's
 the difference?

 It is worth using standards when they buy you enough I'm not sure
 this is one of those times.
 I'm sure there are standards for numeric types like int too... but
 using namespaces for that seems like overkill.

 There's a difference between a primitive type like int, and one like point.
 Also, it all comes down to your use case. If the only thing you're ever
 going to do with SOLR is have a SOLR client talk to it (Java, Ruby, whatever
 PL you want) then namespaces/etc. might be overkill. But why open up the
 response format then and advertise SOLR as something that provides REST-ful
 services for search?

REST-ful doesn't say anything about customizing the response format.

 If that's the case, then users consuming those
 responses need the flexibility to customize them for their use case
 (validation, plugging into external GIS tools, etc.). So, I don't agree with
 this.

What GIS tool could deal with a Solr XML response format w/o any other
knowledge of everything else in the response?
Are there some real use cases that using a namespace vs not for point
make easier (an honest question... I don't know much about GIS stuff).

 All I've done is use what already exists. There doesn't need to be any
 patches. XmlWriter#writePrim allowed you to do this before, see:

Yeah, you can use that to output longfalse/long too... but it will
cause certain clients to barf.

-Yonik
http://www.lucidimagination.com


Re: Namespaces in response (SOLR-1586)

2009-12-09 Thread Yonik Seeley
Should have tried this before... I just created a small XML file:

foo
  barhi/bar
/foo

I pointed both firefox and IE at this file and it displays as XML fine.
I then changed the file to this:

foo
  zoo:barhi/zoo:bar
/foo

That made both of them barf.
That alone makes me lean pretty strongly against using a namespace for this.

-Yonik
http://www.lucidimagination.com



On Wed, Dec 9, 2009 at 12:28 PM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Wed, Dec 9, 2009 at 11:44 AM, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov wrote:
 How does it introduce any new requirements? Namespaces are easily ignored by
 any XML client as they are if they weren't present. In other words, unless
 the XML client has setValidating=true, then this isn't an issue.

 I've run across cases where I added a schema declaration to an XML
 file and then things started failing.  I think some parsers may
 default to validating if it sees that it can?

 Namespaces are to avoid name clashes.  Solr XML is well defined and
 not arbitrary... adding point if we wish to do so won't introduce
 any clashes.

 The only difference between what you call simple above and what I've
 proposed (and correct me if I'm wrong but others have too) is that your
 point tag would include a namespace prefix and an xmlns attribute. What's
 the difference?

 It is worth using standards when they buy you enough I'm not sure
 this is one of those times.
 I'm sure there are standards for numeric types like int too... but
 using namespaces for that seems like overkill.

 There's a difference between a primitive type like int, and one like point.
 Also, it all comes down to your use case. If the only thing you're ever
 going to do with SOLR is have a SOLR client talk to it (Java, Ruby, whatever
 PL you want) then namespaces/etc. might be overkill. But why open up the
 response format then and advertise SOLR as something that provides REST-ful
 services for search?

 REST-ful doesn't say anything about customizing the response format.

 If that's the case, then users consuming those
 responses need the flexibility to customize them for their use case
 (validation, plugging into external GIS tools, etc.). So, I don't agree with
 this.

 What GIS tool could deal with a Solr XML response format w/o any other
 knowledge of everything else in the response?
 Are there some real use cases that using a namespace vs not for point
 make easier (an honest question... I don't know much about GIS stuff).

 All I've done is use what already exists. There doesn't need to be any
 patches. XmlWriter#writePrim allowed you to do this before, see:

 Yeah, you can use that to output longfalse/long too... but it will
 cause certain clients to barf.

 -Yonik
 http://www.lucidimagination.com



Re: Namespaces in response (SOLR-1586)

2009-12-09 Thread Mattmann, Chris A (388J)
Hi Yonik,

 Should have tried this before... I just created a small XML file:
 
 foo
   barhi/bar
 /foo
 
 I pointed both firefox and IE at this file and it displays as XML fine.
 I then changed the file to this:
 
 foo
   zoo:barhi/zoo:bar
 /foo

Sure, of course it does. It's because that's not valid XML syntax. You have
to declare the namespace for zoo. You can do it at the top of the XML file
in the root XML tag. Or, you can do it inline (like I've done in SOLR).

Try this:

foo
 zoo:bar xmlns:zoo=http://example.com/zoo;hi/zoo:bar
/foo

Cheers,
Chris


 
 That made both of them barf.
 That alone makes me lean pretty strongly against using a namespace for this.
 
 -Yonik
 http://www.lucidimagination.com
 
 
 
 On Wed, Dec 9, 2009 at 12:28 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
 On Wed, Dec 9, 2009 at 11:44 AM, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov wrote:
 How does it introduce any new requirements? Namespaces are easily ignored by
 any XML client as they are if they weren't present. In other words, unless
 the XML client has setValidating=true, then this isn't an issue.
 
 I've run across cases where I added a schema declaration to an XML
 file and then things started failing.  I think some parsers may
 default to validating if it sees that it can?
 
 Namespaces are to avoid name clashes.  Solr XML is well defined and
 not arbitrary... adding point if we wish to do so won't introduce
 any clashes.
 
 The only difference between what you call simple above and what I've
 proposed (and correct me if I'm wrong but others have too) is that your
 point tag would include a namespace prefix and an xmlns attribute. What's
 the difference?
 
 It is worth using standards when they buy you enough I'm not sure
 this is one of those times.
 I'm sure there are standards for numeric types like int too... but
 using namespaces for that seems like overkill.
 
 There's a difference between a primitive type like int, and one like point.
 Also, it all comes down to your use case. If the only thing you're ever
 going to do with SOLR is have a SOLR client talk to it (Java, Ruby, whatever
 PL you want) then namespaces/etc. might be overkill. But why open up the
 response format then and advertise SOLR as something that provides REST-ful
 services for search?
 
 REST-ful doesn't say anything about customizing the response format.
 
 If that's the case, then users consuming those
 responses need the flexibility to customize them for their use case
 (validation, plugging into external GIS tools, etc.). So, I don't agree with
 this.
 
 What GIS tool could deal with a Solr XML response format w/o any other
 knowledge of everything else in the response?
 Are there some real use cases that using a namespace vs not for point
 make easier (an honest question... I don't know much about GIS stuff).
 
 All I've done is use what already exists. There doesn't need to be any
 patches. XmlWriter#writePrim allowed you to do this before, see:
 
 Yeah, you can use that to output longfalse/long too... but it will
 cause certain clients to barf.
 
 -Yonik
 http://www.lucidimagination.com
 
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department University of
Southern California, Los Angeles, CA 90089 USA
++




Re: Namespaces in response (SOLR-1586)

2009-12-09 Thread Yonik Seeley
On Wed, Dec 9, 2009 at 12:40 PM, Mattmann, Chris A (388J)
chris.a.mattm...@jpl.nasa.gov wrote:
 foo
  zoo:bar xmlns:zoo=http://example.com/zoo;hi/zoo:bar
 /foo

If you're forced to declare the namespace / put the URI, I'm just
afraid of what clients / XML parsers out there may start trying to
validate by default.  And I'm still trying to figure out what we gain.
 If one does want validation, it seems like we should have an
(optional) schema for the XML response as a whole?

-Yonik
http://www.lucidimagination.com


Re: Namespaces in response (SOLR-1586)

2009-12-09 Thread Chris Hostetter
: I think the initial geosearch feature can start off with
: str10,20/str for a point.

+1.

The current XML format SOlr uses was designed to be extremely simple, very 
JSON-esque, and easily parsable by *anyone* in any langauge, without 
needing special knowledge of types .  It has been heavily advertised as 
only containing a very small handful of tags, representing primitive types 
(int, long, float, date, double, str) and basic collections (arr, lst, 
doc) ... even if id neverh ad a formal shema/DTD.  adding new tags to that 
-- name spaced or otherwise -- is a very VERY bad idea for clients who 
have come to expect that they can use very simple parsing code to access 
all the data.

introducing a new 'point concept, wether as point or as 
georss:point/, is going to break things for people.

As discussed with Mattman in another thread -- some public methods in 
XMLWriter have inadvertantly made it possible for plugin writers to add 
their own XML tags -- but that doesn't mean we should do it in the core 
Solr distribution.  If you write your own custom XMLWriter you aren't 
allowed to be suprised when it contains new tags, but our out of hte box 
users shouldn't have to deal with such suprises.

As also discussed in that same thread thread: it makes a lot of sense 
in the long run to start having Response Writers that can generate more 
rich XML based responses and if there are already well defined standards 
for some of these concepts (like georss) then by all means we should 
support them -- but the existing XmlResponseWriter should NOT start 
generating new tags.

The contract for SolrQueryResponse has always said: 

 A SolrQueryResponse may contain the following types of Objects 
 generated by the SolrRequestHandler that processed the request.  
 ...  
 Other data types may be added to the SolrQueryResponse, but there is 
 no guarantee that QueryResponseWriters will be able to deal with 
 unexpected types.

...unless things have changed since hte last time i looked, all of the 
out of the box response writers call toString() on any object they 
don't understand.  So the best way to move forward in a flexible manner 
seems like it would be to add a new GeoPoint object to Solr, which 
toStrings to a simple -34.56,67.89 for use by existing response writers 
as a string, but some newer smarter response writer could output it in 
some more sophisticated manner.


-Hoss



Re: Namespaces in response (SOLR-1586)

2009-12-09 Thread Chris Hostetter

: ...unless things have changed since hte last time i looked, all of the 
: out of the box response writers call toString() on any object they 
: don't understand.  So the best way to move forward in a flexible manner 
: seems like it would be to add a new GeoPoint object to Solr, which 
: toStrings to a simple -34.56,67.89 for use by existing response writers 
: as a string, but some newer smarter response writer could output it in 
: some more sophisticated manner.

The caveat to that, now that i've skimmed SOLR-1586, is that it currently 
only applies to objects added to the SolrQueryResponse (or one of hte 
containers in it) datastructure that the ResponseWriter's walk 
themselves ... because of the back-ass-wards way we have FieldTypes write 
their values directly to an XMLWriter or a TextWriter the idea of using an 
object that stringifies itself as needed doesn't really apply very well 
... and it won't unless we switch all of the ResponseWRiters to follow the 
BinaryResponseWriter model of using FieldType.toObject(...) to get the 
field value as an obejct that can be sent over the wire -- then the 
existing XmlResponseWriter, and the Text ResponseWriters, can call 
toString() on Objects they doesn't understand, and some 
newer/hipper/cooler response writers that understand georss can do fancier 
things with it.



-Hoss



Re: Namespaces in response (SOLR-1586)

2009-12-09 Thread Mattmann, Chris A (388J)
Hi Yonik,

 
 I've run across cases where I added a schema declaration to an XML
 file and then things started failing.  I think some parsers may
 default to validating if it sees that it can?

I've seen this too. But it won't affect the interaction we're talking about
like I said, SOLR-1586 outputs valid XML, so this isn't an issue.

 
 Namespaces are to avoid name clashes.  Solr XML is well defined and
 not arbitrary... adding point if we wish to do so won't introduce
 any clashes.
 

Actually there are quite a bit of use cases for namespacing beyond name
clashes. Namespaces enable validation, understanding and definition for
elements (understanding units, ranges, etc.). For instance, you and I both
use the term mass, but in my domain, mass refers to the planetary science
definition of mass, but, in your domain you mean earth science. mass does
not always mean the same thing (variation in units, representation, etc.)

See here:

http://www.w3.org/TR/2006/REC-xml-names11-20060816/

 The only difference between what you call simple above and what I've
 proposed (and correct me if I'm wrong but others have too) is that your
 point tag would include a namespace prefix and an xmlns attribute. What's
 the difference?
 
 It is worth using standards when they buy you enough I'm not sure
 this is one of those times.
 I'm sure there are standards for numeric types like int too... but
 using namespaces for that seems like overkill.
 
 There's a difference between a primitive type like int, and one like point.
 Also, it all comes down to your use case. If the only thing you're ever
 going to do with SOLR is have a SOLR client talk to it (Java, Ruby, whatever
 PL you want) then namespaces/etc. might be overkill. But why open up the
 response format then and advertise SOLR as something that provides REST-ful
 services for search?
 
 REST-ful doesn't say anything about customizing the response format.

So are you saying that the intention is not to allow customization of the
response format? Also you've released how many releases of SOLR that have
the capability to do this and now you're suddenly going to change it? I'm
sorry I disagree.

 
 If that's the case, then users consuming those
 responses need the flexibility to customize them for their use case
 (validation, plugging into external GIS tools, etc.). So, I don't agree with
 this.
 
 What GIS tool could deal with a Solr XML response format w/o any other
 knowledge of everything else in the response?
 Are there some real use cases that using a namespace vs not for point
 make easier (an honest question... I don't know much about GIS stuff).

Using standards enables standard tool development. Unless you want everyone
to develop their own custom tools for SOLR (or be tied to using whatever is
provided by SOLR _only_), and I don't think that's the intent. I also don't
think that's a very friendly, open strategy for users. What I'm proposing
does _not_ break backwards compatibility, anywhere. If you've got an
example, then speak up.

 
 All I've done is use what already exists. There doesn't need to be any
 patches. XmlWriter#writePrim allowed you to do this before, see:
 
 Yeah, you can use that to output longfalse/long too... but it will
 cause certain clients to barf.

That's a ResponseWriter issue. That's not a client issue. Clients don't
arbitrarily connect to servers for which they don't speak the protocol
language.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department University of
Southern California, Los Angeles, CA 90089 USA
++




Re: Namespaces in response (SOLR-1586)

2009-12-09 Thread Mattmann, Chris A (388J)

 Any parser that does that is so broken that you should stop using it
 immediately. --wunder

Walter, totally agree here.

Cheers,
Chris


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department University of
Southern California, Los Angeles, CA 90089 USA
++




Re: Namespaces in response (SOLR-1586)

2009-12-09 Thread Ramirez, Paul M (388J)
Hey All,

I think Eric is right on here and what I thought the intent of the patch was. 
Facilitating integration of Solr into environments where there is not one true 
XML output. In addition, there shouldn't be one true JSON output for cases 
where your existing code already has a way it expects the JSON. Why not allow 
someone to write a JSON output that feeds directly into that tool without 
having to change that tool. This is what makes Solr so cool is because of its 
flexibility and to limit that would be a shame. None of this really has to 
limit the internal representation or what the Solr community builds to support 
it's format but don't unnecessarily relegate that functionality to XSLT.

--Paul


On 12/9/09 11:22 AM, Eric Pugh ep...@opensourceconnections.com wrote:



Is this the opportunity of having more then one XML output type?  I
mean, XML is meant to be a transport medium for data, and maybe moving
from a one true XML output for Solr to being able to support
multiple outputs dependent on the consumer would be useful.  I can see
it making it easier to plug Solr into environments that expect data in
certain formats, without doing an extra XSL transformation?

Eric




Re: Namespaces in response (SOLR-1586)

2009-12-09 Thread Walter Underwood
On Dec 9, 2009, at 11:11 AM, Mattmann, Chris A (388J) wrote:

 
 Any parser that does that is so broken that you should stop using it
 immediately. --wunder
 
 Walter, totally agree here.

To elaborate my position:

1. Validation is a user option. The XML spec makes that very clear. We've had 
10 years to get that right, and anyone who auto-validates is not paying 
attention. Validation is very useful when you are creating XML, rarely useful 
when reading it.

2. XML namespaces are string prefixes that use the URL syntax. They do not 
follow URI rules for anything but syntax and there is no guarantee that they 
can be resolved. In fact, an XML parser can't do anything standard with the 
result if they do resolve. Again, we've had 10 years to figure that out.

Yes, this can be confusing, but if a parser author can't figure it out, don't 
use their parser because they are already getting the simple stuff wrong.

wunder






Re: Namespaces in response (SOLR-1586)

2009-12-09 Thread Mattmann, Chris A (388J)
Hi Yonik,

 Using standards enables standard tool development.
 
 We do use standards... lots of them :-)  Let's be a bit more specific
 though - I was asking about using a namespace for the point type by
 *default*, and in isolation (i.e. the rest of solr xml isn't
 namespaced), and if/how that made things easier?

Let's ask a different question -- how does it make things harder?

 At first blush it
 doesn't really seem to since any tool would need to deal with the Solr
 XML response in general.

I've got use cases where folks writing APIs in Javascript/Ajax are querying
SOLR (as a REST-ful web service) and elements of the response are being
dropped into a web page via DHTML. Having the ability to drop tags that
include namespaces helps out those folks because they want to have:

(a) expected representations using standards they like (GeoRSS is on the
list).

(b) understanding of the elements they are dropping in (i.e., there is one
use case where separately, after dropping in the georss:point tag, the tag
definition (e.g., via the namespace at:
http://www.w3.org/2003/01/geo/wgs84_pos#) is looked up and displayed.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department University of
Southern California, Los Angeles, CA 90089 USA
++