On Tue, Jul 5, 2011 at 4:32 PM, Alessio Stalla <alessiosta...@gmail.com> wrote:
> On 5 Lug, 18:49, Ken Wesson <kwess...@gmail.com> wrote:
>> 1. A too-large string literal should have a specific error message,
>> rather than generate a misleading one suggesting a different type of
>> problem.
>
> There is no such thing as a too-large string literal in a class file.

That's not what Patrick just said.

>> 2. The limit should not be different from that on String objects in
>> general, namely 2147483647 characters which nobody is likely to hit
>> unless they mistakenly call read-string on that 1080p Avatar blu-ray
>> rip .mkv they aren't legally supposed to possess.
>
> That's a limitation imposed by the Java class file format.

And therefore a bug in the Java class file format, which should allow
any size String that the runtime allows. Using 2 bytes instead of 4
bytes for the length field, as you claim they did, seems to be the
specific error. One would have thought that Java of all languages
would have learned from the Y2K debacle and near-miss with
cyber-armageddon, but limiting a field to 2 of something instead of 4
out of a misguided perception that space was at a premium was exactly
what caused that, too!

>> 3. Though both of the above bugs are in Oracle's Java implementation,
>
> By the above, 1. is a Clojure bug and 2. is not a bug at all.

Oh, 2 is a bug alright. By your definition, Y2K bugs in a piece of
software would also not be bugs. The users of such software would beg
to differ.

>> it would seem to be a bug in Clojure's compiler if it is trying to
>> make the entire source code of a namespace into a string *literal* in
>> dynamically-generated bytecode somewhere rather than a string
>> *object*.
>
> Actually it seems it's the IDE, rather than Clojure, that is
> evaluating a form containing such a big literal. Since Clojure has no
> interpreter, it needs to compile that form.

The same problem has been reported from multiple IDEs, so it seems to
be a problem with eval and/or load-file. The question is not why they
might be using String *objects* that exceed 64K, since they'll need to
use Strings as large as the file gets*. It's why they'd *generate
bytecode* containing String *literals* that large.

And it's not IDEs that generate bytecode it's
clojure.lang.Compiler.java that generates bytecode in this scenario.

* There is a way to reduce the size requirements; crudely, line-seq
could be used to implement a lazy seq of top-level forms built by
consuming lines until delimiters are balanced and them emitting a new
form string, then evaluating these forms one by one. This works with
typical source files that have short individual top-level forms and
have at least 1 line break between any two such and would allow
consuming multi-gig source files if anyone ever had need for such a
thing (I'd hope never to see it unless it was machine-generated for
some purpose). Less crudely, a reader for files could be implemented
that didn't just slurp the file and call read-string on it but instead
read from an IO stream and emitted a seq of top-level forms converted
already into reader-output data structures (but unevaluated). In fact,
read-string could then be implemented in terms of this and a
StringInputStream whose implementation is left as an exercise for the
reader but which ought to be nearly trivial.

>> Sensible alternatives are a) get the string to whatever
>> consumes it by some other means than embedding it as a single
>> monolithic constant in bytecode,
>
> This is what we currently do in ABCL (by storing literal objects in a
> thread-local variable and retrieving them later when the compiled code
> is loaded), but it only works for the runtime compiler, not the file
> compiler (in Clojure terms, it won't work with AOT compilation).

Yes, this is the same issue raised in connection with allowing
arbitrary objects in code in eval.

>> b) convert long strings into shorter
>> chunks and emit a static initializer into the bytecode to reassemble
>> them with concatenation into a single runtime-computed string constant
>> stored in another static field,
>
> This is what I'd like to have :)

Frankly it seems like a bit of a hack to me, though since it would be
used to work around a Y2K-style bug in Java it might be poetic justice
of a sort.

>> and c) restructure whatever consumes
>> the string to consume a seq, java.util.List, or whatever of strings
>> instead and feed it digestible chunks (e.g. a separate string for each
>> defn or other top-level form, in order of appearance in the input file
>> -- surely nobody has *individual defns* exceeding 64KB).
>
> The problem is not in the consumer, but in the form containing the
> string; to do what you're proposing, the reader, upon encountering a
> big enough string, would have to produce a seq/List/whatever instead,
> the compiler would need to be able to dump such an object to a class,
> and all Clojure code handling strings would have to be prepared to
> handle such an object, too. I think it's a little impractical.

I don't think so. The problem isn't with normal strings but only with
strings that get embedded as literals in code; and moreover, the
problem isn't even with those strings exceeding 64k but with whole
.clj files exceeding 64k. The implication is that load-file generates
a class that contains the entire contents of the sourcefile as a
string constant for some reason; so:

a) What does this class do with this string constant? What code consumes it?

b) Can that particular bit of code be rewritten to digest the same
information provided in smaller chunks?

> Regarding the size of individual defns, that's an orthogonal problem;
> anyway, the size of the _bytecode_ for methods is limited to 64KB (see
> <http://java.sun.com/docs/books/jvms/second_edition/html/
> ClassFile.doc.html#88659>) and, while pretty big, it's not impossible
> to reach it, especially when using complex macros to produce a lot of
> generated code.

Another problem for which we will probably need an eventual fix or
workaround. If bytecode can contain a JMP-like instruction it should
be possible to have the compiler split long generated methods and
chain the pieces together without much loss of runtime efficiency,
particularly if it does so at "natural" places -- existing conditional
branches, particularly, and (loop ...) borders -- (defn foo (if x
(lotta-code-1) (lotta-code-2))) for example can be trivially converted
to (defn foo (if x (lotta-code-1) (jmp bar))) (defn bar
(lotta-code-2)) -- though if you had such a jump instruction I'd have
thought implementing real TCO would have been fairly easy, and
apparently it was not.

Failing such a jmp capability you'd have to just use (bar) in that
last example and suffer an additional method call overhead at the
break-point. Again, the obvious way to do it would be to recognize
common branching construct forms such as (if ...) and (cond ...) that
are larger than the threshold but have individual branches that are
not and turn some or all of the branches into their own under-the-hood
methods and calls to those methods.

> We used to generate such big methods in ABCL because
> at one point we tried to spell out in the bytecode all the class names
> corresponding to functions in a compiled file, in order to avoid
> reflection when loading the compiled functions. For files with many
> functions (> 1000 iirc) the generated code became too big. It turned
> out that this optimization had a negligible impact on performance, so
> we reverted it.

I wonder if Clojure is using a similar optimization and would benefit
from its reversion.

-- 
Protege: What is this seething mass of parentheses?!
Master: Your father's Lisp REPL. This is the language of a true
hacker. Not as clumsy or random as C++; a language for a more
civilized age.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to