Earlier I wrote:
> ...eText to XML/XHTML/HTML and not bother to inflict markup on people.
Instead, I'll use white space intelligently along with Rebol embedded in the
script (in a very nice way), to generate web pages.
Here is the eText specification that Garold and I have been working on.
Naturally, the document itself is in the eText format.
We're discussing eText on this list: [EMAIL PROTECTED] Subscribe to
the list: [EMAIL PROTECTED]
Andrew Martin
ICQ: 26227169
http://members.nbci.com/AndrewMartin/
-><-
-- Attached file included as plaintext by Listar --
-- File: eText_am_02_glj.txt
eText
Note: The above line is the title of the HTML document and the text in the H1 tag for
the HTML.
Author: Andrew Martin
eMail: [EMAIL PROTECTED]
Date: 16/November/2000
Site: http://members.nbci.com/AndrewMartin/
Comment: Did you know that writing the contents of a Rebol header is now quite natural
(at least to me)? First word followed by a colon is the trigger for META data in the
eText document. The data for the item continues on until the end of the line. The
pattern could be continued...
GLJ -- I use the word / phrase ':' construct routinely from long before Rebol. I think
we should retain it, but I am not certain yet just how best to do that.
This is a header
Why is the above line a header (H2 in HTML)? Because it's separated from the text
above and below by 2 blank lines. It's also short, less than 40 odd characters, so it
must be header. It also _doesn't_ have a terminating period or full stop at the end or
some other sentence terminator.
GLJ -- I see that we are working from slightly different perspectives. I was not
working on guessing at the intended or incidental structure of totally free format
text. Rathere I was looking at text that was intended to have a structure but was
prepared in ASCII without access to further formatting. That is why I don't object to
marking such things as indent levels. I also tend to think in outlines naturally, and
to compose that way.
This is a subsequent paragraph. This paragraph will be have the first line indented on
the HTML nice looking version. The above paragraph is an initial paragraph, a
paragraph that shouldn't have the first line indented. Note that I'm letting my text
editor wrap lines appropriately as I can't be bothered doing it for my tools, I'd much
rather let my tools wrap my text for me. Wouldn't you?
I can tell that the above text are paragraphs, because they have a full stop at the
end, are long, and have only one blank line between them. They also have multiple
sentences in them, with a period (or other similar terminator, like "-", ":", ";", "?"
or "!") at the end of each sentence.
GLJ -- I work in a programmer's editor which word wraps automatically rather than
something like Notepad or Worpad where the wrapping is only visual. I definitely *do*
want the text re-wrapped as needed. HTML will wrap it anyway. This implies to me that
blank lines separate paragraphs. This runs me into problems with list items which I
tend not to separate.
Double Quotes
Surrounding text in double quotes should leave the text unchanged. This effect stops
at a newline or end of line, just in case they're not balanced? We might need to think
more about this.
GLJ -- No-Tags treats any item that would normally expect a balancing item as being a
single character rather than markup if there is no balancing character within the
paragraph. That is that effects are limited to paragraphs. This needs some work.
Short single lines of text between paragraphs, with one blank line before and after
and no sentence terminator, should be a H3 heading or sub-heading. Perhaps at most 40
characters in length.
GLJ -- How do you tell H2 from H3 from ...? I find a need for a title and at least 3
levels of heading. Since HTML supports 6 levels I saw no reason not to do so as well.
If I insist on writing one sentence long paragraphs, going on and on, droning about
nothing at all, until you are tired reading this text, it will come out all inside a
H2 paragraph tag, and will be very obvious to all, and so should be very embarrassing
at immediate glance, that the full stop or period is missing at the end of this
sentence
If I try to trick the interpreter
The above line could be considered a sub-header. Or it might be a sentence fragment.
It's short enough to be considered a sub-header, so even if the interpreter is wrong
and makes it a sub-header using H3 HTML tag, it still makes the error obvious to human
eyes.
----
The above should be one section of text
The above line should be a header, H2 in HTML. I totally agree with your statement:
"The Basic Idea of eText is to allow documents to be created in relatively plain text
that is still human readable." I'd also like to add that eText should be easy to
create and modify for unsophisticated users, who only know how to type, and more
computer professionals, who may be a bit tired and want some that's obvious.
eText also *shouldn't* require manual line wrapping. I'm using a Windows Notepad
replacement to generate this text, the standard Windows Notepad should be able to cope
as well. Just turn "Word Wrap" on.
GLJ -- I agree that eText should be suitable for unsophisticated users. I tend to
think that I want a smooth transition to fairly complete control. I would like my
plain text to allow me to go quite a ways before I need to move to better tools. I
would even consider marking the document for level of formality from no markup to
intermediate markup to full markup. I have no quarrel with the translator gussing at
this. For example, if I use '#' to mark headers it should feel free to assume that all
headers are marked, etc. I find that programs that get too smart can get very hard to
outsmart. I at least want a way to override the programs guesses. That would allow me
to feed plain text in and then go "fix up" the places where it didn't guess right. I
will discuss the purposes of eText later.
I've improved on my ideas for recognising headers by closely examing your text in
markup.mtx and noticed how your level 3 headings had a blank line before and after.
Going backwards to two blank lines to separate level 2 headings is consistent with
Project Gutenberg eText, and seems fairly obvious.
Character-wise markup
For character-wise markup, I'd like to use the following (basically as you suggest):
"*bold*" - Asterix text asterix is bold or strong emphasis.
"_underline_" - Underline text underline is underline or mild emphasis.
"~italic~" - Tilde text tilde is italic.
"=fixed=" - Equals text equals is fixed width.
"--" - Two dash inside text is en-dash.
"---" - Three dash inside text is em-dash.
"----" - Four dash or more in the left column is a horizontal rule.
"====" - Four equals or more in the left colum is a bigger break, perhaps using the
DIV tag in HTML. I'm not sure what it should be yet.
"^text" - superscripts the following text until the first white space.
"^^text" - subscripts the following text?
The above text should end up as a HTML table of definitions. The pattern should be:
Text [TAB | SPACE] "-" [TAB | SPACE] text
The above line (effectively code or script) should be blockquoted, preformatted and in
typewriter font. The pattern is two tabs in from the left margin. The text should be
unchanged, except for HTML tags, which should be translated so as to be visible in a
browser. In other words, "<" should be "<", ">" should be ">" and "&" should be
"&".
GLJ -- While I went back and changed this to a table, I wouldn't have considered doing
it automatically. I often do such things as this with responses -- or other material
-- set off by dashes. If all text were prepared with paragraphs recognized by newline,
then perhaps the fact that there were multiple lines could signal a table?
Embedding HTML
Do we really need to? A graphically intensive page would be better designed using GIF,
JPG, Flash and Style Sheets, and should have a computer graphics artist working on the
project. There would be very little human-readable text in the page, I feel. Still, if
we're careful, it could be included?
GLJ -- The only reasons for considering embedded HTML are: 1) Twiki allowed it and was
one of the systems I was considering, 2) It is currently the only approach I have to
font size and color, and 3) The notation doesn't rule it out. I don't care for
embedded HTML, but more and more people are allowing it and using it. Also formatting
within table cells is easier with some of the tags. I object to HTML as being the only
way to do things which some systems insist on.
Embedding Rebol script
Rebol Server Pages (RSP)
I think the ASP "standard", that I adapted for Rebol, should serve:
"<%!" - Directives or meta-information.
"<%" - Rebol code that is not intended to return a value.
"<%:" - Rebol code that returns a value.
"%>" - Terminate any of the above opening tags.
The ASP "standard" uses "@" for directives and "=" for expressions that return a value.
GLJ -- I don't have a quarrel with any procedure that works here. I think that
something that looks familiar to anyone who has embedded other languages in HTML is
acceptable. I don't think there is any reason for introducing anything really foreign
looking. The suggested use of braces came from MTX and Latte which found that they
provided minimal interference.
Embedded Rebol
For simple embedding of Rebol values in the text, I'd suggest using ":" or colon to
mark the start of rebol word to "get". For example, the date now is :now. The ":now"
will be replaced with the rebol value for now, 17/Oct/2010 or whatever the time/date
is. If the value is a file!, then the value gets substituted appropriately into the
resulting HTML file.
GLJ -- Ok -- provisionally. I think that trying to guess too cleverly will result in
surprises. I prefer that all programs adhere to the WYGINS (What You Get Is No
Surprise) principle.
Links
Here's a link to my site: http://members.nbci.com/AndrewMartin/. My email address:
[EMAIL PROTECTED] Here's some boiler plate text to click on: file:MyBoilerPlate.txt.
Note that Rebol and Oscar can recognise url! and email! datatypes. I think a sentence,
an initial capital letter terminated with a ":", followed by a url! or email! (or
"text.txt" for file!) and followed by a period, should be sufficient grounds to be
considered a link.
So the above text should look like this in HTML:
<a href="http://members.nbci.com/AndrewMartin/">Here's a link to my
site</a>.
and:
<a href="mailto:[EMAIL PROTECTED]">My email address</a>.
and:
<a href="MyBoilerPlate.txt">Here's some boiler plate text to click
on</a>.
Note that the ":" has been removed, and the "." is after the link. I've found that
having the period outside the link is more pleasing to my eye at least.
GLJ -- The URL and the email address are fine. Most browsers and modern mail clients
will do this. I am wary of the file notation as there isn't any standard practice that
I can tie it to.
GLJ -- "Here's some boiler plate text to click on: file:MyBoilerPlate.txt. " I think
that the "file:" sneaked in there. "Here's some boiler plate text to click on:
MyBoilerPlate.txt." matches the pattern you describe [ Initial cap words: <something
linkable>. ] works for me. It is not so strange that I couldn't adjust.
A picture could be recognised by the above pattern, then checked for .gif or .jpg
extensions and substituting a "<image>" tag instead of the "<a>" tag. The text before
the colon substitutes for the ALT text.
GLJ -- Once we recognize that we have a link, trying to to interpret the extension is
perfectly reasonable.
When creating the page interactively, like in a Wiki or Sparrow, when a link to a
local file is created, the system will create a "blank" page for the link to go to (if
there's not already a page of that name). The system should also be free to modify the
case of file name in the link to agree with pre-existing local files. It should also
substitute "%20" for spaces in filenames, much like MS Internet Explorer does, and
Rebol does with URLs.
GLJ -- I think I like the Wiki idea of a '?' that is a link to create a non-existent
local link. It is easy to get used to and immediately points out spelling mistakes
when presented.
Wiki Word Links
I dislike the WikiWordLinks. For example a link to Wiki requires the word to be
written as "WIki", which is ~unnatural~ to me.
GLJ -- It depends on what your are doing. If you are working in a Wiki where most of
the point is linking to other material, I think WikiWords make lots of sense. The fact
that there is a surprise in other text is unfortunate, but the alterantive is a
requirement to format every embedded link specially which violates our premise of
natural ease of use. Wiki suffers from lack of any easy way to mark non-WikiWord
links. The WikiWord tends to be natural to those who came from programming
environments such as Pascal. Also, Wiki is not intended to translate text that wasn't
created for it.
Including files
:"My Boiler Plate Text File.txt"
The above line is a command to include the contents of the %"My Boiler Plate Text
File.txt" straight into this text when converted to HTML. The colon can be read as
"get (or evaluate) the thing to the right". The complete contents of the file are
substituted for the line (including the newline at the end).
GLJ -- This is starting to get dicey. This construct doesn't suggest to me any special
tratment. It is available for use from our syntax rules, but I don't care for it. I
suggest that at the level of file inclusion and variables we consider that we have
entered the realm of embedded Rebol. I would prefer to see this as <%: %"My Boiler
Plate Text File.txt" %> This is embedded Rebol returning a value whiich is the content
of a file. I don't even object to using specific directives for this.
I think that "%*.txt" files should be executed in this page as if their contents had
been written in manually. I'm not sure what should happen with other file extensions.
Perhaps "%*.html" and "%*.htm" files should be inserted *after* the page gets
translated into HTML?
GLJ -- I agree that "%*.txt" files should get included at the time of construction. If
we allow embedded HTML, there is no problem with including them at the time we expand
the file. Rebol Server Pages presumably have a ".rsp" extension and will get included
when seen also.
Lists
I like just simply using a asterix in the left margin for unordered lists:
* My first list item.
* Another item.
I think that's reasonably sensible. A ordered list just substitutes "I" for Roman
(followed by a tab), "0" or zero for Arabic, "A" or "a" for capital or lower case
letters. For nested lists, we could try simply using one or more tabs (or several
spaces) before the list item. Like:
* Unordered list item.
0 First item
0 Second item
0 Third
* Option
I Roman list item
A Arabic list item
A Another
I More romans
Naturally, the system numbers list items sequentially, and doesn't care when we
shuffle the order of the above, and will always number the items correctly. An
optional few characters after the list item "key", like ")./" should be easy to add to
the parse rules for list items.
An improvement on the above would be to allow 0 - 9 for Arabic, correctly written
roman numerals for Roman, A - Z and a - z for Letters.
GLJ -- It seems to me that bulleted lists are unordered and ordered lists have values.
I wouldn't format an ordered list without item "numbers". Output should create the
numbers in sequence, as they will automatically if we use lists in HTML. Most of the
systems require whitespace to indent a list item and measure whitespace to determine
sub lists, but there is no real reason that the top-level list needs to be indented. I
think determining sublists from indentation makes sense. I suggest something like: 1)
The first list element determines the type of the (sub)list -- 'A' - Upper Arabic, 'a'
- Lower Arabic, 'I' - Upper Roman, 'i' - Lower Roman, '0-9' - Numbered, '*' | '-' |
'o' (others?) - bulleted (unordered) list. I don't know about the support for
numbering styles in all browsers. The intent is that if it looks like lists it should
create lists. I think that implies allowing the "number" to be followed by (at least)
')' or '.'. I suggest that any of the styles that Word!
or a similar program will number automatically should be allowed -- well, maybe not.
I just looked and Word tries to manage things that I wouldn't. Under the WYGINS
principle, I don't think we should try to support anything we can't render into
reasonable HTML. I know that lists can be done as tables because I have seen it done.
I consider that extreme. We are dealing with simple ideas and the result should be
relativedly simple HTML when rendered.
Tables
"By default, tables will have borders, blanks for empty cells, and resize to cover 95%
of the window side to side." I disagree with "borders". I think white space is better,
along with careful alignment of contents. Otherwise I'm OK with this.
GLJ -- This is the MTX convention. I have no quarrel with any defaults if they can be
overridden. Possibly white space for tables that we guess at and borders if the tables
are created using vertical bars? I still don't know how to manage multi line text
within a cell. It is unusual but not really rare. When such a table is rendered in
text it is usually shown with horizontal cell boundaries (|----|----|) as well as
vertical ones. Since any such complex tables will require specialized formatting, I
suggest we leave that for later and probably then deal with it as embedded HTML or
something similar.
Processing the eText file
1 Parse the text, processing any include files to form one big file. Nest as
appropriate.
2 Process the one big file, executing the Rebol script embedded in the file and
substituting into the text any results as indiated. The page header object generated
from the meta information at the top should be generated first in this step.
GLJ -- We may decide that the guessing pass needs to be separate? I don't know how
much context that will require but it may well be more that is convenient to handle
along with a formatting pass.
3 Generate the HTML code as appropriate.
GLJ -- No problem with that. Generally the files will be relatively small so there
isn't any problem there. Even if we have to write an intermediate file we certainly
can. there are optimizations that we can perform if we need to later.
GLJ:
Let's revisit the uses and purposes of eText and see if we can get anything from it. I
see a gradient scale of input:
1) Accept really plain text that was never intended to be formatted as other than
plain text and turn it into something that has reasonable formatting. This requires
guessing at the intent from the structure because there were never any real standards
or conventions used in it. The Project Gutenberg texts are the primary example.
2) Format the sort of text that is used in emails and ASCII intended to convey a
little information. This requires at least *bold* support, lists, some way of dealing
with block quoted text such as seen in email, and Auto linking for URLs and email
addresses. Preformatted blocks for code is also needed.
3) Provide for simple text entry that is intended to generate web pages as in Wiki.
Since this is intended to be done "on the fly" it doesn't add much, perhaps italics,
specific heading levels, and simple tables.
4) Text documents that are intended to be formatted. These can be generated directly
or possibly as program output intended for later formatting. This user may be
unsophisticated in terms of formatting languages but is likely quite sophisticated or
at least deliberate about the formatting that is intended. This would support most of
the intentional forms of formatting and some directive type markup. Such features as
the MTX table formatting come to mind. Section and chapter breaks, specified heading
levels, block quotes, pull quotes, etc. are all candidates.
5) Text that is intended to be templates for generating complex web sets. At this
point we have a specification language whose intent is to generate web pages. The
assumption of unsophistacated users and casual text entry give way to the idea of
control that scales up well. Embedded Rebol is used and needed. Embedded HTML or
advanced directives make sense. Certainly some form of variables and limited
programming make sense.
Are there any steps in the gradient that I have missed?
Given this, you started at item 1 and I started around item 2 or 3, and that creates
some difference in viewpoint.
I can see some flags to the program to control the level of guessing versus rules
applied. Even with type 1 items, I guess I see the objective as guessing as well as
possible and then generating a more advanced form of the text or adding enough
corrective markup to the text to correct places where the program guessed incorrectly
rather than always rendereing the same plain text repeatedly.
Using this gradient can we define a scalable set of rules that supports all the levels
or do we have to have more than 1 system to cover the range?
One area of difficulty is the determination of paragraphs. While Worpad and such
program wrap text only visually, many editors, including programming editors and email
editors physically wrap the text using newlines. This gives rise to the convention of
delimiting paragraphs with blank lines. Except in tables and lists where lines are
considered separate. Since the wrapped text is the more difficult and handles the
unwrapped text also, I think we need to deal with it.
I find that I want to use multiple lines in list elements and in some types of tables.
I find that I assume that table and list items (which are individually marked) assume
that each item or row constitutes a new paragraph. I will start a list (usually
indented) after a paragraph without inserting a separating blank line.
When I generate a list that looks more like a table I usually intend to have a line on
the left be in the same cell as the multiple lines on the right. e.g.:
-----
Twiki Used for editing web pages using a browser.
Supports automatic links of "wiki words" with multiple caps
in the word. Add '?' if not in index, create from '?'
link.
Uses %...% for variables which are defined externally and
are multi-level.
Allows HTML tags to be embedded.
Generates full navigation bar, editing, file attachments.
Caegory mechanism is very interesting. Compare to jrju.
No-Tags A plain text markup system.
-----
I would like our system to do something reasonable with this or to have a way to make
it do something reasonable without undue hardship.
A paragraph is determined by 1 or more blank lines or a list item:
-----
* Review the use of embedded HTML tags as is done Twiki
* Review the question of embedded language. Mark embedded code
with '\{% ... %\}' to allow for arbitrary code
-----
There are 2 paragraphs here and they are both bulleted items.
I hope that this framework will allow us to establish a scalable set of rules that we
can still implement with reasonable effort to cover a wide range of eText applications.
That's all for the moment.
--
To unsubscribe from this list, please send an email to
[EMAIL PROTECTED] with "unsubscribe" in the
subject, without the quotes.