[The Java Posse] Re: Java officially lags C

Reinier Zwitserloot Wed, 02 Sep 2009 13:23:09 -0700

Casper: I don't think you grokked my point.

I'm saying it's impossible to build any java, vanilla or otherwise,
that can handle this. For the reasons I stated: You'd have to flip the
architecture upside down and resolve 'DSL' properly midway through
tokenizing it. Be aware that this automatically means that any error
caused by the DSL provider *HAS TO* stop the parsing right there on
the spot, no further error reporting for anything that follows the DSL
block. Tricks IDEs do to make a class file with whatever methods have
syntax errors in them replaced with dummies that throw exceptions
would be impossible.

You'd be giving up an awful lot.

Don't get me wrong, I love the idea, but I haven't seen a workable
proposal yet. I'm leaning towards the notion that it's impossible to
get right. Fan tries to use a sufficiently arcane separator (bar-
angle, so <| special code goes here |>), but if java uses the same
thing, then you can't embed fan in java. That's not a solution.

Here's a simplistic approach to something that might actually work:

1. identifier resolution is decoupled from the rest of the source file
for parsing. In other words, the parser will parse all import
statements, resolve them, and only then continue on its way.

2. blocks start with a hash, followed by a type identifier. This type
identifier is resolved only according to import statements; to make
this smooth, the definitions for how to handle these blocks MUST
ALWAYS be top level members, no exceptions. Now the parser does not
have to consider inner classes and such to resolve the name; the
process of checking the current package and all import statements
suffices.

3. The tokenizer will remember the character that followed the token
(e.g. the non-identifier character that immediately followed the last
identifier character in the DSL name, which can be a space, a quote, a
brace, whatever), and restuffs this back into the source view. The
tokenizer then hands the raw source (as a Reader or some such) off to
the .tokenize() method of the provider. The tokenizer MUST return any
object, and have consumed exactly up to (and including) the closing
element of the DSL block.

4. During compilation, the DSL block (which is an expression which can
have an arbitrary type, including void) is translated into a pure java
expression by calling the .parse() method of the DSL provider.

5. Exceptions during the tokenize phase result in the immediate end of
parsing that java source file, as javac will not know where to
continue. Exceptions during the parse method aren't nearly as drastic;
it just means there's an error in the DSL block and the block's
expression is of an unknown type - certainly not rocket science
compared to the advanced error recovery employed in many IDEs.

public interface DSLProvider<T> {
    public T tokenize(SourceReader reader);
    public String parse(T token, Context c);
}

some open issues are: What should 'parse' return - there's an argument
to be made for: 'bytecode', 'raw java source as a String', and 'a
JCExpression object (from javac's internal AST classes). Each has its
advantages and disadvantages.

Context is some useful construct that allows access to variables legal
in the current scope, the filer (for looking up types), and similar
things. A lot of this API already exists (annotation processor API).

Such a system could rather easily support a wide variety of stuff you
may wish to inject into java source files:

 - String literals
 - Regexp literals - the compiled regexp tree would be stored into the
class file.
 - XML literals
 - multiline and/or raw string literals.
 - python - even including python's whitespace based delimiting as the
mechanism to delimit the block ITSELF, if you think that is a good
idea.
 - Clojure, LISP, and other lisp dialects.
 - just about every programming language in existence (incl. ruby,
Javascript, C, C#, C++, fortran, ada, and, sure, why not - APL).

The documentation should stress that the .tokenize() method really
should try its very best to return and not throw an exception.

hypothetical source:

int x = #python:
    5 + 5
int thisIsJavaAgain;

String long = #long """This is a long string where \backslashes need
not be escaped""" + "this is parsed by javac again";
Pattern p = #regexp /[abc]d\s+(\d*)/i;

Presuming that the context object is sufficiently advanced, this
should also be possible, especially if you add a way to parse a java
snippet in that context:

private final Comparator<Integer> absoluteComparator = #closure
Comparator(Integer a, Integer b) { return Integer.compare(Math.abs(a),
Math.abs(b)); };

Of course, trying to include java inside such a block has the same
issue as javac's original problem: How does the closure DSL provider
know where the closure ends without being as complicated as javac's
tokenizer? Theoretically java itself could be implemented with this
scheme, and you could then start the snippet parser at the 'return'
statement, getting a tokenized object back, which, during your parse
phase, you can get parsed by calling on javac's own parse method.

The central point is this: You have to split tokenizing and parsing.
This is yet another instance where fan tries to take the easy way out.

On Sep 2, 7:39 pm, Casper Bang <casper.b...@gmail.com> wrote:
> > tell me how the compiler could possibly sort this out? The only way is
> > for the compiler to hand off the entire process of TOKENIZING this
> > stream to the DSL provider for 'longString', which is an entirely
> > different architecture - right now all java parsers do the fairly
> > usual thing of tokenizing the whole deal, then tree-izing the whole
> > thing, and only then starting the process of resolving 'DSL' into
> > "java.lang.DSL" or whatever you had in mind.
>
> Oh sure, I should had mentioned explicitly how this obviously won't
> work with a vanilla javac. Anyway here's the original post I was
> referring:http://www.jroller.com/scolebourne/entry/enhancing_java_multi_lingual...
>
> > You'd have to create very specific rules about how the compiler can
> > find the end of the DSL string. I've thought about this and have not
> > been able to come up with a particularly sensible rule. The only one I
> > can think of is to stick to C-esque rules: strings are things in
> > double or single quotes, and use backslash internally for escapes, and
> > braces are supposed to be matched. However, these restrictions already
> > remove most other languages: You can't put python in there (multi-line
> > strings will screw up java's parser), you can't put regular
> > expressions in there (no rule enforcing matched quotes or braces). You
> > can't put XML in there (no rule enforcing matched braces or quotes).
> > No go.
>
> Well it's not a trivial issue no, but this is how it work in 
> Fan:http://fandev.org/sidewalk/topic/438
>
> /Casper
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "The 
Java Posse" group.
To post to this group, send email to javaposse@googlegroups.com
To unsubscribe from this group, send email to 
javaposse+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/javaposse?hl=en
-~----------~----~----~----~------~----~------~--~---

[The Java Posse] Re: Java officially lags C

Reply via email to