Re: StandardCharset vs. StandardCharsets

Ulf Zibis Sat, 07 May 2011 05:19:53 -0700

Hi all,

please excuse, that I have still problems to accept this additional class, but 
+1 for the plural name.

If those charset constants are there, people _will use_ them without respect on the existing_performance disadvantages_.

A common typical use case should be: String.getBytes(...)

On small strings there is a performance lost up to 25 % using the charset variant vs. the charsetname variant. See:

http://cr.openjdk.java.net/~sherman/7040220/client
http://markmail.org/message/2tbas5skgkve52mz
http://markmail.org/thread/lnrozcbnpcl5kmzs

So I still think, we should have the standard charset names as constants in 
class j.n.c.Charset:
    public static final String UTF_8 = "UTF-8";  etc...

To avoid duplicate declarations, class sun.nio.cs.StandardCharsets should refer them. (Afterautomatic build generation of s.n.c.StandardCharsets.java generation replace "UTF-8" byjava.nio.charset.Charset.UTF_8 etc...)

If I understand right, the reasonable for this j.n.c.StandardCharsets class is to avoid too muchclasses to be loaded while booting the VM.

(1.) This could easily be achieved by lazy initialized static final methods:j.n.c.Charset.charsetUTF_8() etc... or more short j.n.c.Charset.UTF_8() etc... (unconventionalupper-case method name would indicate that a constant is returned). Aside it would remember that thereturned constants are platform vendor implementation dependent.

So there is no need for an additional class j.n.c.StandardCharsets to be 
maintained and loaded.

(2.) Alternatively/Additionally ...

In ***Bug 100098* <https://bugs.openjdk.java.net/show_bug.cgi?id=100098> - Make sun.nio.cs.* charsetobjects light-weight ** I have demonstrated that there is no need to have an *individual class foreach charset*, so we won't need to care about 6 charset classes to be loaded during VM boot.Besides the lookup of a charset should be much faster then before, as it doesn't anymore need aClass.forName(...) call, as you can see there, in my FastCharsetProvider class.This change seems to be too invasive in the current state of JDK-7, but in the meantime we couldhave something (which demonstrates the advantage to have the canonical names as constant) like:


package sun.nio.cs;
...
import static java.nio.charset.Charset.*;
import static sun.nio.cs.StandardCharsets.*;
public class FastCharsetProvider extends CharsetProvider {
    ...

    private Charset lookup(String charsetName) {

        String csn = canonicalize(toLower(charsetName));

        // Check cache first
        Charset cs = cache.get(csn);
        if (cs != null)
            return cs;

        StandardCharset: {
            switch (charsetName) {
                case US_ASCII : cs = new StandardCharset(US_ASCII, 
aliases_US_ASCII); break;
                case ISO_8859_1 : cs = new StandardCharset(ISO_8859_1, 
aliases_ISO_8859_1); break;
                case UTF_8 : cs = new StandardCharset(UTF_8, aliases_UTF_8); 
break;
                case UTF_16 : cs = return new StandardCharset(UTF_16, 
aliases_UTF_16); break;
                case UTF_16BE : cs = new StandardCharset(UTF_16BE, 
aliases_UTF_16BE); break;
                case UTF_16LE : cs = new StandardCharset(UTF_16LE, 
aliases_UTF_16LE); break;
                default : break StandardCharset;
            }
            cache.put(csn, cs);
            return cs;
        }

        // Do we even support this charset?
        String cln = classMap.get(csn);
        if (cln == null)
            return null;

        // Instantiate the charset and cache it
        ...
    }

    private final class StandardCharset extends Charset {
        private StandardCharset(String canonicalName, String[] aliases) {
            super(canonicalName, aliases);
        }
        newDecoder() {
            switch (name()) {
                case US_ASCII : return new sun.nio.cs.US_ASCII.Decoder();
                case ISO_8859_1 : return new sun.nio.cs.ISO_8859_1.Decoder();
                case UTF_8 : return new sun.nio.cs.UTF_8.Decoder();
                case UTF_16 : return new sun.nio.cs.UTF_16.Decoder();
                case UTF_16BE : return new sun.nio.cs.UTF_16BE.Decoder();
                case UTF_16LE : return new sun.nio.cs.UTF_16LE.Decoder();
                default : return Charset.defaultCharset().newDecoder();
        }
        newEncoder() { ... }
        ...
    }
    ...

}

So I tend to prefer the original request from 4884238 (have the canonical names as constants), asthe lookup via Charset.forName(...) then could be very fast compared to the anyway following heavyde/encoding work.


In that case we additionally could avoid such weird constructs
(potentially superfluously forces the UTF-8 charset class to be loaded):
 114     private ZipCoder(Charset cs) {
 115         this.cs = cs;
 116         this.isUTF8 = cs.name().equals(StandardCharset.UTF_8.name());
 117     }

Instead we could have:
 116         this.isUTF8 = cs.name().equals(Charset.UTF_8);


-Ulf


Am 02.05.2011 23:16, schrieb mark.reinh...@oracle.com:

2011/4/29 16:28 -0700, mike.dui...@oracle.com:

Hi Mark;

I'm still not on your wavelength on this issue. For two reasons:

- Collections, Arrays, Channels, Objects aren't typically used for
holding constants. They are either factories or collections of static
utility methods. Only Collections actually defines any constants and
those probably should have been put on the respective interfaces.

The constants in the Collections class weren't put into the related
interfaces because that would have polluted any implementing class or
subinterface with that constant name.  Constants in interfaces are
almost always a bad idea.

The closest example seems to be PosixFilePermission (an enum)
vs. PosixFilePermissions (static utility methods). The one difference
is that PosixFilePermission is exhaustive, no other permissions are
possible, vs the standard charsets where they represent only a subset
of the available charsets. AFAIK even ME distributions have more than
just the standard charsets.

For the standard charsets we've got no methods, only constants. Is
Charsets preferred only because we want to avoid putting these
constants on Charset itself for the previously cited performance
reasons?

To be clear, I'm not objecting to the prefix "Standard" -- I'm just
arguing in favor of the suffix "s".

- The standard charset constants do seem like an enum. The enumeration
they belong to is easily named and part of the platform
specification.

Conceptually, I agree.

                Putting these definitions all into one class makes
membership obvious--the standard charsets are the ones in this
class.

Yep.

        Charsets not defined in this class are not standard and thus not
guaranteed to be available. (which is an interesting point, though the
standard charsets are now visible in the platform we still don't
provide a way to test whether a given charset is a member of the set of
standard charsets).

That is an omission, but I wonder if anyone cares?  (A possible fix would
be to define StandardCharsets.ALL as a set containing all the singletons
defined in the class.)

Barring history and implementation details I don't see why we wouldn't
be defining these constants as an enum implementing Charset the same
way as we've defined the jsr203 Standard* enums. The enum members would
have more implementation than the jsr203 Standard* enums, ie. more than
identity but almost all of that is constant data itself.

That would require Charset to have been defined as an interface, which
in turn would've added substantial complexity to the original design.
At any rate, it's water under the bridge at this point.

What makes these constants *not* an enumeration?

If I have the enum type

     enum Foo { BAR, BAZ }

then the following assignment works:

     Foo x = Foo.BAR;

A statement of that form does not work with your StandardCharset class:

     StandardCharset x = StandardCharset.US_ASCII;

That's because StandardCharset is a final class and it does not extend
Charset, nor any other class or interface except Object.  It's just a
holder for a bunch of convenient constants, like java.util.Arrays and
j.u.Collections, etc.  The only difference is that it defines some
static constant values but no static methods.

In short, the hallmark of an enum type is that its members all have the
same type, and that type is the enum type itself.  This is true whether
you use the fancy enum construct from Java 5 or the older hand-coded enum
pattern.  Your StandardCharset class is not an enum type.

In order to fix this you could change StandardCharset to extend Charset,
but then you'd have to jump through some hoops to make the standard
charset implementation classes extend StandardCharset rather than plain
Charset, and you'd have to be careful not to allow anyone to extend
StandardCharset.  Offhand this seems to require moving the standard
charset implementations into the java.nio.charset package itself.

That's an awful lot of effort for just a handful of constants.

I'm not trying to be difficult on this, honestly.

I understand.  Me neither.

- Mark

Re: StandardCharset vs. StandardCharsets

Reply via email to