Re: resin and UTF-8 in URLs

2007-02-02 Thread Chris Hostetter
: For XML, I think trusting the XML parser, and not the servlet : container is a better way to go. : That means handing the XML parser an InputStream instead of a Reader. you mean if there is no charset in the content-type? ... yeah, that was what i (think i) was suggesting as far as XML goes, tr

Re: resin and UTF-8 in URLs

2007-02-02 Thread Walter Underwood
On 2/1/07 6:00 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote: > That may be, but Solr was only publicly available for 9 months before we > had someone running into confusion because they were tyring to post an XML > file that wasn't UTF-8 :) > > http://www.nabble.com/wana-use-CJKAnalyzer-tf

Re: resin and UTF-8 in URLs

2007-02-02 Thread Yonik Seeley
Some standalone tests for charset handling would be nice... something that we could use to test the major servlet containers w/ Solr before finalizing a release. If someone is having problems with international chars, they could also run the tests against their particular server. -Yonik

Re: resin and UTF-8 in URLs

2007-02-02 Thread Yonik Seeley
On 2/1/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: ...the only real question in my mind is what to do if user supplied data has *NO* charset information of any kind ... for XML the spec seems very clear that in that case you test for UTF-8 or UTF-16 ... but for arbitrary streams of character d

Re: resin and UTF-8 in URLs

2007-02-01 Thread Chris Hostetter
: The XML spec says that XML parsers are only required to support : UTF-8, UTF-16, ISO 8859-1, and US-ASCII. If you use a different : encoding for XML, there is no guarantee that a conforming parser : will accept it. there may not be a garuntee -- but shouldn't we at least try to respect the clie

Re: resin and UTF-8 in URLs

2007-02-01 Thread Walter Underwood
On 2/1/07 3:18 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote: > > As for XML, or any other format a user might POST to solr (or ask solr > to fetch from a remote source) what possible reason would we have to only > supporting UTF-8? .. why do you suggest that the XML standard "specify > UTF-8, [s

Re: resin and UTF-8 in URLs

2007-02-01 Thread Chris Hostetter
: > Solr, in my opinion, shouldn't have the string "UTF-8" hardcoded in it : > anywhere -- not even in the example config: new users shouldn't need to : > know about have any special solrconfig options that must be (un)set to get : > Solr to use their servlet container / system default charset. :

Re: resin and UTF-8 in URLs

2007-02-01 Thread Walter Underwood
On 2/1/07 2:53 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote: > Solr, in my opinion, shouldn't have the string "UTF-8" hardcoded in it > anywhere -- not even in the example config: new users shouldn't need to > know about have any special solrconfig options that must be (un)set to get > Solr to

Re: resin and UTF-8 in URLs

2007-02-01 Thread Chris Hostetter
: I am only suggesting it for GET requests where the params are pulled : off the query string. Apparently, UTF-8 is the *only* ok URL encoding : : http://www.w3.org/International/O-URL-code.html : : It is strange, that resin and tomcat don't observe this unless it is : specified as the default enc

Re: resin and UTF-8 in URLs

2007-02-01 Thread Chris Hostetter
: Let's not make this complicated for situations that we've never : seen in practice. Java is a Unicode language and always has been. : Anyone running a Java system with a Shift-JIS default should already : know the pitfalls, and know them much better than us (and I know a : lot about Shift-JIS).

Re: resin and UTF-8 in URLs

2007-02-01 Thread Ryan McKinley
: If we can do something small that makes the most normal cases work : even if the container is not configured, that seems good. but how do we know the user wants what we consider a "normal cases" to work? ... if every servlet container lets you configure your default charset differently, we hav

Re: resin and UTF-8 in URLs

2007-02-01 Thread Walter Underwood
Let's not make this complicated for situations that we've never seen in practice. Java is a Unicode language and always has been. Anyone running a Java system with a Shift-JIS default should already know the pitfalls, and know them much better than us (and I know a lot about Shift-JIS). The URI sp

Re: resin and UTF-8 in URLs

2007-02-01 Thread Chris Hostetter
: If we can do something small that makes the most normal cases work : even if the container is not configured, that seems good. but how do we know the user wants what we consider a "normal cases" to work? ... if every servlet container lets you configure your default charset differently, we have

Re: resin and UTF-8 in URLs

2007-02-01 Thread Chris Hostetter
: > : > request.setCharacterEncoding ("utf-8") : > ...my reading of the servlet spec was that request.setCharacterEncoding : > only impacted request *body* data, not the URL. : > According to the javadocs for it, using it also means that if the client : > is well behaved and *does* set a charse

Re: resin and UTF-8 in URLs

2007-02-01 Thread Ryan McKinley
it seems like every servlet container has some way of configuring the default, so we should just rely on that and not add our own default I agree, except that in the world of first time (and even seasoned) web-app/system developers/maintainers, we don't always set things up properly! or even kn

Re: resin and UTF-8 in URLs

2007-02-01 Thread Yonik Seeley
On 2/1/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : > should we add: : > request.setCharacterEncoding ("utf-8") : > to GET requests in StandardRequestParser? : : Perhaps. I wonder if there's any performance impact, and if it fixes : Tomcat's default of latin1 too. see my comments in the r

Re: resin and UTF-8 in URLs

2007-02-01 Thread Chris Hostetter
: > should we add: : > request.setCharacterEncoding ("utf-8") : > to GET requests in StandardRequestParser? : : Perhaps. I wonder if there's any performance impact, and if it fixes : Tomcat's default of latin1 too. see my comments in the related thread about POST... http://www.nabble.com/chars

Re: resin and UTF-8 in URLs

2007-02-01 Thread Yonik Seeley
On 2/1/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: should we add: request.setCharacterEncoding ("utf-8") to GET requests in StandardRequestParser? Perhaps. I wonder if there's any performance impact, and if it fixes Tomcat's default of latin1 too. -Yonik

Re: resin and UTF-8 in URLs

2007-02-01 Thread Ryan McKinley
should we add: request.setCharacterEncoding ("utf-8") to GET requests in StandardRequestParser?

Re: resin and UTF-8 in URLs

2007-02-01 Thread Yonik Seeley
FYI, I talked to Caucho, and for params in the query string of a URI they use the charset of the request (which defaults to latin1). It will likely be fixed in the 3.1 line. They indicated that setting the charset before asking for the parameters would also work: request.setCharacterEncoding ("u

Re: resin and UTF-8 in URLs

2007-01-31 Thread Yonik Seeley
On 2/1/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: I just tried this on two systems... it worked on one (I got the ê) and the other I get ê -- both running resin 3.0.21 A co-worker informed me that adding a character-encoding attribute to the web-app tag in web.xml will force a charset if not

Re: resin and UTF-8 in URLs

2007-01-31 Thread Ryan McKinley
I just tried this on two systems... it worked on one (I got the ê) and the other I get ê -- both running resin 3.0.21 The one that works has http://securityfilter.sourceforge.net/ applied. I'll look into what securityfilter is doing... it may be setting something explicitly

resin and UTF-8 in URLs

2007-01-31 Thread Yonik Seeley
So, we've conquered UTF-8 input in URLs for Jetty and Tomcat, so how about Resin? Right now, I can't get Resin 3.0.22 to see an e with a circumflex via the following: curl -i 'http://localhost:8983/solr/select?q=%C3%AA&echoParams=explicit' -Yonik