: For XML, I think trusting the XML parser, and not the servlet
: container is a better way to go.
: That means handing the XML parser an InputStream instead of a Reader.
you mean if there is no charset in the content-type? ... yeah, that was
what i (think i) was suggesting as far as XML goes, tr
On 2/1/07 6:00 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:
> That may be, but Solr was only publicly available for 9 months before we
> had someone running into confusion because they were tyring to post an XML
> file that wasn't UTF-8 :)
>
> http://www.nabble.com/wana-use-CJKAnalyzer-tf
Some standalone tests for charset handling would be nice... something
that we could
use to test the major servlet containers w/ Solr before finalizing a release.
If someone is having problems with international chars, they could
also run the tests against their particular server.
-Yonik
On 2/1/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
...the only real question in my mind is what to do if user supplied data
has *NO* charset information of any kind ... for XML the spec seems very
clear that in that case you test for UTF-8 or UTF-16 ... but for arbitrary
streams of character d
: The XML spec says that XML parsers are only required to support
: UTF-8, UTF-16, ISO 8859-1, and US-ASCII. If you use a different
: encoding for XML, there is no guarantee that a conforming parser
: will accept it.
there may not be a garuntee -- but shouldn't we at least try to respect
the clie
On 2/1/07 3:18 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:
>
> As for XML, or any other format a user might POST to solr (or ask solr
> to fetch from a remote source) what possible reason would we have to only
> supporting UTF-8? .. why do you suggest that the XML standard "specify
> UTF-8, [s
: > Solr, in my opinion, shouldn't have the string "UTF-8" hardcoded in it
: > anywhere -- not even in the example config: new users shouldn't need to
: > know about have any special solrconfig options that must be (un)set to get
: > Solr to use their servlet container / system default charset.
:
On 2/1/07 2:53 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:
> Solr, in my opinion, shouldn't have the string "UTF-8" hardcoded in it
> anywhere -- not even in the example config: new users shouldn't need to
> know about have any special solrconfig options that must be (un)set to get
> Solr to
: I am only suggesting it for GET requests where the params are pulled
: off the query string. Apparently, UTF-8 is the *only* ok URL encoding
:
: http://www.w3.org/International/O-URL-code.html
:
: It is strange, that resin and tomcat don't observe this unless it is
: specified as the default enc
: Let's not make this complicated for situations that we've never
: seen in practice. Java is a Unicode language and always has been.
: Anyone running a Java system with a Shift-JIS default should already
: know the pitfalls, and know them much better than us (and I know a
: lot about Shift-JIS).
: If we can do something small that makes the most normal cases work
: even if the container is not configured, that seems good.
but how do we know the user wants what we consider a "normal cases" to
work? ... if every servlet container lets you configure your default
charset differently, we hav
Let's not make this complicated for situations that we've never
seen in practice. Java is a Unicode language and always has been.
Anyone running a Java system with a Shift-JIS default should already
know the pitfalls, and know them much better than us (and I know a
lot about Shift-JIS).
The URI sp
: If we can do something small that makes the most normal cases work
: even if the container is not configured, that seems good.
but how do we know the user wants what we consider a "normal cases" to
work? ... if every servlet container lets you configure your default
charset differently, we have
: > : > request.setCharacterEncoding ("utf-8")
: > ...my reading of the servlet spec was that request.setCharacterEncoding
: > only impacted request *body* data, not the URL.
: > According to the javadocs for it, using it also means that if the client
: > is well behaved and *does* set a charse
it seems like every servlet container has some way of configuring the
default, so we should just rely on that and not add our own default
I agree, except that in the world of first time (and even seasoned)
web-app/system developers/maintainers, we don't always set things up
properly! or even kn
On 2/1/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
: > should we add:
: > request.setCharacterEncoding ("utf-8")
: > to GET requests in StandardRequestParser?
:
: Perhaps. I wonder if there's any performance impact, and if it fixes
: Tomcat's default of latin1 too.
see my comments in the r
: > should we add:
: > request.setCharacterEncoding ("utf-8")
: > to GET requests in StandardRequestParser?
:
: Perhaps. I wonder if there's any performance impact, and if it fixes
: Tomcat's default of latin1 too.
see my comments in the related thread about POST...
http://www.nabble.com/chars
On 2/1/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:
should we add:
request.setCharacterEncoding ("utf-8")
to GET requests in StandardRequestParser?
Perhaps. I wonder if there's any performance impact, and if it fixes
Tomcat's default of latin1 too.
-Yonik
should we add:
request.setCharacterEncoding ("utf-8")
to GET requests in StandardRequestParser?
FYI, I talked to Caucho, and for params in the query string of a URI
they use the charset of the request (which defaults to latin1). It
will likely be fixed in the 3.1 line.
They indicated that setting the charset before asking for the
parameters would also work:
request.setCharacterEncoding ("u
On 2/1/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:
I just tried this on two systems... it worked on one (I got the ê) and
the other I get ê -- both running resin 3.0.21
A co-worker informed me that adding a character-encoding attribute to
the web-app tag in web.xml will force a charset if not
I just tried this on two systems... it worked on one (I got the ê) and
the other I get ê -- both running resin 3.0.21
The one that works has http://securityfilter.sourceforge.net/ applied.
I'll look into what securityfilter is doing... it may be setting
something explicitly
So, we've conquered UTF-8 input in URLs for Jetty and Tomcat, so how
about Resin?
Right now, I can't get Resin 3.0.22 to see an e with a circumflex via
the following:
curl -i 'http://localhost:8983/solr/select?q=%C3%AA&echoParams=explicit'
-Yonik
23 matches
Mail list logo