[REBOL] eText

Andrew Martin Fri, 17 Nov 2000 11:59:43 -0800

Earlier I wrote:
> ...eText to XML/XHTML/HTML and not bother to inflict markup on people.
Instead, I'll use white space intelligently along with Rebol embedded in the
script (in a very nice way), to generate web pages.

Here is the eText specification that Garold and I have been working on.
Naturally, the document itself is in the eText format.

We're discussing eText on this list: [EMAIL PROTECTED] Subscribe to
the list: [EMAIL PROTECTED]

Andrew Martin
ICQ: 26227169
http://members.nbci.com/AndrewMartin/
-><-



-- Attached file included as plaintext by Listar --
-- File: eText_am_02_glj.txt

eText
Note: The above line is the title of the HTML document and the text in the H1 tag for 
the HTML.
Author: Andrew Martin
eMail: [EMAIL PROTECTED]
Date: 16/November/2000
Site: http://members.nbci.com/AndrewMartin/
Comment: Did you know that writing the contents of a Rebol header is now quite natural 
(at least to me)? First word followed by a colon is the trigger for META data in the 
eText document. The data for the item continues on until the end of the line. The 
pattern could be continued...

GLJ -- I use the word / phrase ':' construct routinely from long before Rebol. I think 
we should retain it, but I am not certain yet just how best to do that.

This is a header


Why is the above line a header (H2 in HTML)? Because it's separated from the text 
above and below by 2 blank lines. It's also short, less than 40 odd characters, so it 
must be header. It also _doesn't_ have a terminating period or full stop at the end or 
some other sentence terminator.

GLJ -- I see that we are working from slightly different perspectives. I was not 
working on guessing at the intended or incidental structure of totally free format 
text. Rathere I was looking at text that was intended to have a structure but was 
prepared in ASCII without access to further formatting. That is why I don't object to 
marking such things as indent levels. I also tend to think in outlines naturally, and 
to compose that way. 

This is a subsequent paragraph. This paragraph will be have the first line indented on 
the HTML nice looking version. The above paragraph is an initial paragraph, a 
paragraph that shouldn't have the first line indented. Note that I'm letting my text 
editor wrap lines appropriately as I can't be bothered doing it for my tools, I'd much 
rather let my tools wrap my text for me. Wouldn't you?

I can tell that the above text are paragraphs, because they have a full stop at the 
end, are long, and have only one blank line between them. They also have multiple 
sentences in them, with a period (or other similar terminator, like "-", ":", ";", "?" 
or "!")  at the end of each sentence.

GLJ -- I work in a programmer's editor which word wraps automatically rather than 
something like Notepad or Worpad where the wrapping is only visual. I definitely *do* 
want the text re-wrapped as needed. HTML will wrap it anyway. This implies to me that 
blank lines separate paragraphs. This runs me into problems with list items which I 
tend not to separate.

Double Quotes

Surrounding text in double quotes should leave the text unchanged. This effect stops 
at a newline or end of line, just in case they're not balanced? We might need to think 
more about this.

GLJ -- No-Tags treats any item that would normally expect a balancing item as being a 
single character rather than markup if there is no balancing character within the 
paragraph. That is that effects are limited to paragraphs. This needs some work.

Short single lines of text between paragraphs, with one blank line before and after 
and no sentence terminator, should be a H3 heading or sub-heading. Perhaps at most 40 
characters in length.

GLJ -- How do you tell H2 from H3 from ...? I find a need for a title and at least 3 
levels of heading. Since HTML supports 6 levels I saw no reason not to do so as well.

If I insist on writing one sentence long paragraphs, going on and on, droning about 
nothing at all, until you are tired reading this text, it will come out all inside a 
H2 paragraph tag, and will be very obvious to all, and so should be very embarrassing 
at immediate glance, that the full stop or period is missing at the end of this 
sentence

If I try to trick the interpreter

The above line could be considered a sub-header. Or it might be a sentence fragment. 
It's short enough to be considered a sub-header, so even if the interpreter is wrong 
and makes it a sub-header using H3 HTML tag, it still makes the error obvious to human 
eyes.

----


The above should be one section of text


The above line should be a header, H2 in HTML. I totally agree with your statement: 
"The Basic Idea of eText is to allow documents to be created in relatively plain text 
that is still human readable." I'd also like to add that eText should be easy to 
create and modify for unsophisticated users, who only know how to type, and more 
computer professionals, who may be a bit tired and want some that's obvious.

eText also *shouldn't* require manual line wrapping. I'm using a Windows Notepad 
replacement to generate this text, the standard Windows Notepad should be able to cope 
as well. Just turn "Word Wrap" on.

GLJ -- I agree that eText should be suitable for unsophisticated users. I tend to 
think that I want a smooth transition to fairly complete control. I would like my 
plain text to allow me to go quite a ways before I need to move to better tools. I 
would even consider marking the document for level of formality from no markup to 
intermediate markup to full markup. I have no quarrel with the translator gussing at 
this. For example, if I use '#' to mark headers it should feel free to assume that all 
headers are marked, etc. I find that programs that get too smart can get very hard to 
outsmart. I at least want a way to override the programs guesses. That would allow me 
to feed plain text in and then go "fix up" the places where it didn't guess right. I 
will discuss the purposes of eText later.

I've improved on my ideas for recognising headers by closely examing your text in 
markup.mtx and noticed how your level 3 headings had a blank line before and after. 
Going backwards to two blank lines to separate level 2 headings is consistent with 
Project Gutenberg eText, and seems fairly obvious.

Character-wise markup

For character-wise markup, I'd like to use the following (basically as you suggest):

"*bold*"        - Asterix text asterix is bold or strong emphasis.
"_underline_"   - Underline text underline is underline or mild emphasis.
"~italic~"      - Tilde text tilde is italic.
"=fixed="       - Equals text equals is fixed width.
"--"    - Two dash inside text is en-dash.
"---"   - Three dash inside text is em-dash.
"----"  - Four dash or more in the left column is a horizontal rule.
"===="  - Four equals or more in the left colum is a bigger break, perhaps using the 
DIV tag in HTML. I'm not sure what it should be yet.
"^text" - superscripts the following text until the first white space.
"^^text"        - subscripts the following text?

The above text should end up as a HTML table of definitions. The pattern should be:

                Text [TAB | SPACE] "-" [TAB | SPACE] text

The above line (effectively code or script) should be blockquoted, preformatted and in 
typewriter font. The pattern is two tabs in from the left margin. The text should be 
unchanged, except for HTML tags, which should be translated so as to be visible in a 
browser. In other words, "<" should be "&lt;", ">" should be "&gt;" and "&" should be 
"&amp;".

GLJ -- While I went back and changed this to a table, I wouldn't have considered doing 
it automatically. I often do such things as this with responses -- or other material 
-- set off by dashes. If all text were prepared with paragraphs recognized by newline, 
then perhaps the fact that there were multiple lines could signal a table?

Embedding HTML


Do we really need to? A graphically intensive page would be better designed using GIF, 
JPG, Flash and Style Sheets, and should have a computer graphics artist working on the 
project. There would be very little human-readable text in the page, I feel. Still, if 
we're careful, it could be included?

GLJ -- The only reasons for considering embedded HTML are: 1) Twiki allowed it and was 
one of the systems I was considering, 2) It is currently the only approach I have to 
font size and color, and 3) The notation doesn't rule it out. I don't care for 
embedded HTML, but more and more people are allowing it and using it. Also formatting 
within table cells is easier with some of the tags. I object to HTML as being the only 
way to do things which some systems insist on.

Embedding Rebol script


Rebol Server Pages (RSP)

I think the ASP "standard", that I adapted for Rebol, should serve:

"<%!"   - Directives or meta-information.
"<%"    - Rebol code that is not intended to return a value.
"<%:"   - Rebol code that returns a value.
"%>"    - Terminate any of the above opening tags.

The ASP "standard" uses "@" for directives and "=" for expressions that return a value.

GLJ -- I don't have a quarrel with any procedure that works here. I think that 
something that looks familiar to anyone who has embedded other languages in HTML is 
acceptable. I don't think there is any reason for introducing anything really foreign 
looking. The suggested use of braces came from MTX and Latte which found that they 
provided minimal interference.

Embedded Rebol

For simple embedding of Rebol values in the text, I'd suggest using ":" or colon to 
mark the start of rebol word to "get". For example, the date now is :now. The ":now" 
will be replaced with the rebol value for now, 17/Oct/2010 or whatever the time/date 
is. If the value is a file!, then the value gets substituted appropriately into the 
resulting HTML file.

GLJ -- Ok -- provisionally. I think that trying to guess too cleverly will result in 
surprises. I prefer that all programs adhere to the WYGINS (What You Get Is No 
Surprise) principle.


Links


Here's a link to my site: http://members.nbci.com/AndrewMartin/. My email address: 
[EMAIL PROTECTED] Here's some boiler plate text to click on: file:MyBoilerPlate.txt. 
Note that Rebol and Oscar can recognise url! and email! datatypes. I think a sentence, 
an initial capital letter terminated with a ":", followed by a url! or email! (or 
"text.txt" for file!) and followed by a period, should be sufficient grounds to be 
considered a link.

So the above text should look like this in HTML:
                <a href="http://members.nbci.com/AndrewMartin/">Here's a link to my 
site</a>.
        and:
                <a href="mailto:[EMAIL PROTECTED]">My email address</a>.
        and:
                <a href="MyBoilerPlate.txt">Here's some boiler plate text to click 
on</a>.

Note that the ":" has been removed, and the "." is after the link. I've found that 
having the period outside the link is more pleasing to my eye at least.

GLJ -- The URL and the email address are fine. Most browsers and modern mail clients 
will do this. I am wary of the file notation as there isn't any standard practice that 
I can tie it to.

GLJ -- "Here's some boiler plate text to click on: file:MyBoilerPlate.txt. " I think 
that the "file:" sneaked in there. "Here's some boiler plate text to click on: 
MyBoilerPlate.txt." matches the pattern you describe [ Initial cap words: <something 
linkable>. ] works for me. It is not so strange that I couldn't adjust.

A picture could be recognised by the above pattern, then checked for .gif or .jpg 
extensions and substituting a "<image>" tag instead of the "<a>" tag. The text before 
the colon substitutes for the ALT text.

GLJ -- Once we recognize that we have a link, trying to to interpret the extension is 
perfectly reasonable.

When creating the page interactively, like in a Wiki or Sparrow, when a link to a 
local file is created, the system will create a "blank" page for the link to go to (if 
there's not already a page of that name). The system should also be free to modify the 
case of file name in the link to agree with pre-existing local files. It should also 
substitute "%20" for spaces in filenames, much like MS Internet Explorer does, and 
Rebol does with URLs.

GLJ -- I think I like the Wiki idea of a '?' that is a link to create a non-existent 
local link. It is easy to get used to and immediately points out spelling mistakes 
when presented.

Wiki Word Links

I dislike the WikiWordLinks. For example a link to Wiki requires the word to be 
written as "WIki", which is ~unnatural~ to me.

GLJ -- It depends on what your are doing. If you are working in a Wiki where most of 
the point is linking to other material, I think WikiWords make lots of sense. The fact 
that there is a surprise in other text is unfortunate, but the alterantive is a 
requirement to format every embedded link specially which violates our premise of 
natural ease of use. Wiki suffers from lack of any easy way to mark non-WikiWord 
links. The WikiWord tends to be natural to those who came from programming 
environments such as Pascal. Also, Wiki is not intended to translate text that wasn't 
created for it.


Including files


:"My Boiler Plate Text File.txt"

The above line is a command to include the contents of the %"My Boiler Plate Text 
File.txt" straight into this text when converted to HTML. The colon can be read as 
"get (or evaluate) the thing to the right". The complete contents of the file are 
substituted for the line (including the newline at the end).

GLJ -- This is starting to get dicey. This construct doesn't suggest to me any special 
tratment. It is available for use from our syntax rules, but I don't care for it. I 
suggest that at the level of file inclusion and variables we consider that we have 
entered the realm of embedded Rebol. I would prefer to see this as <%: %"My Boiler 
Plate Text File.txt" %> This is embedded Rebol returning a value whiich is the content 
of a file. I don't even object to using specific directives for this.

I think that "%*.txt" files should be executed in this page as if their contents had 
been written in manually. I'm not sure what should happen with other file extensions. 
Perhaps "%*.html" and "%*.htm" files should be inserted *after* the page gets 
translated into HTML?

GLJ -- I agree that "%*.txt" files should get included at the time of construction. If 
we allow embedded HTML, there is no problem with including them at the time we expand 
the file. Rebol Server Pages presumably have a ".rsp" extension and will get included 
when seen also.

Lists


I like just simply using a asterix in the left margin for unordered lists:

*       My first list item.
*       Another item.

I think that's reasonably sensible. A ordered list just substitutes "I" for Roman 
(followed by a tab), "0" or zero for Arabic, "A" or "a" for capital or lower case 
letters. For nested lists, we could try simply using one or more tabs (or several 
spaces) before the list item. Like:

*       Unordered list item.
        0       First item
        0       Second item
        0       Third
*       Option
        I       Roman list item
                A       Arabic list item
                A       Another
        I       More romans

Naturally, the system numbers list items sequentially, and doesn't care when we 
shuffle the order of the above, and will always number the items correctly. An 
optional few characters after the list item "key", like ")./" should be easy to add to 
the parse rules for list items.

An improvement on the above would be to allow 0 - 9 for Arabic, correctly written 
roman numerals for Roman, A - Z and a - z for Letters.

GLJ -- It seems to me that bulleted lists are unordered and ordered lists have values. 
I wouldn't format an ordered list without item "numbers". Output should create the 
numbers in sequence, as they will automatically if we use lists in HTML. Most of the 
systems require whitespace to indent a list item and measure whitespace to determine 
sub lists, but there is no real reason that the top-level list needs to be indented. I 
think determining sublists from indentation makes sense. I suggest something like: 1) 
The first list element determines the type of the (sub)list -- 'A' - Upper Arabic, 'a' 
- Lower Arabic, 'I' - Upper Roman, 'i' - Lower Roman, '0-9' - Numbered, '*' | '-' | 
'o' (others?) - bulleted (unordered) list. I don't know about the support for 
numbering styles in all browsers. The intent is that if it looks like lists it should 
create lists.  I think that implies allowing the "number" to be followed by (at least) 
')' or '.'. I suggest that any of the styles that Word!
 or a similar program will number automatically should be allowed -- well, maybe not. 
I just looked and Word tries to manage things that I wouldn't. Under the WYGINS 
principle, I don't think we should try to support anything we can't render into 
reasonable HTML. I know that lists can be done as tables because I have seen it done. 
I consider that extreme. We are dealing with simple ideas and the result should be 
relativedly simple HTML when rendered.


Tables


"By default, tables will have borders, blanks for empty cells, and resize to cover 95% 
of the window side to side." I disagree with "borders". I think white space is better, 
along with careful alignment of contents. Otherwise I'm OK with this.

GLJ -- This is the MTX convention. I have no quarrel with any defaults if they can be 
overridden. Possibly white space for tables that we guess at and borders if the tables 
are created using vertical bars? I still don't know how to manage multi line text 
within a cell. It is unusual but not really rare. When such a table is rendered in 
text it is usually shown with horizontal cell boundaries (|----|----|) as well as 
vertical ones. Since any such complex tables will require specialized formatting, I 
suggest we leave that for later and probably then deal with it as embedded HTML or 
something similar.

Processing the eText file


1       Parse the text, processing any include files to form one big file. Nest as 
appropriate.
2       Process the one big file, executing the Rebol script embedded in the file and 
substituting into the text any results as indiated. The page header object generated 
from the meta information at the top should be generated first in this step.

GLJ -- We may decide that the guessing pass needs to be separate? I don't know how 
much context that will require but it may well be more that is convenient to handle 
along with a formatting pass.

3       Generate the HTML code as appropriate.

GLJ -- No problem with that. Generally the files will be relatively small so there 
isn't any problem there. Even if we have to write an intermediate file we certainly 
can. there are optimizations that we can perform if we need to later.

GLJ:
Let's revisit the uses and purposes of eText and see if we can get anything from it. I 
see a gradient scale of input:

1) Accept really plain text that was never intended to be formatted as other than 
plain text and turn it into something that has reasonable formatting. This requires 
guessing at the intent from the structure because there were never any real standards 
or conventions used in it. The Project Gutenberg texts are the primary example.

2) Format the sort of text that is used in emails and ASCII intended to convey a 
little information. This requires at least *bold* support, lists, some way of dealing 
with block quoted text such as seen in email, and Auto linking for URLs and email 
addresses. Preformatted blocks for code is also needed.

3) Provide for simple text entry that is intended to generate web pages as in Wiki. 
Since this is intended to be done "on the fly" it doesn't add much, perhaps italics, 
specific heading levels, and simple tables.

4) Text documents that are intended to be formatted. These can be generated directly 
or possibly as program output intended for later formatting. This user may be 
unsophisticated in terms of formatting languages but is likely quite sophisticated or 
at least deliberate about the formatting that is intended. This would support most of 
the intentional forms of formatting and some directive type markup. Such features as 
the MTX table formatting come to mind. Section and chapter breaks, specified heading 
levels, block quotes, pull quotes, etc. are all candidates.

5) Text that is intended to be templates for generating complex web sets. At this 
point we have a specification language whose intent is to generate web pages. The 
assumption of unsophistacated users and casual text entry give way to the idea of 
control that scales up well. Embedded Rebol is used and needed. Embedded HTML or 
advanced directives make sense. Certainly some form of variables and limited 
programming make sense.

Are there any steps in the gradient that I have missed?

Given this, you started at item 1 and I started around item 2 or 3, and that creates 
some difference in viewpoint.

I can see some flags to the program to control the level of guessing versus rules 
applied. Even with type 1 items, I guess I see the objective as guessing as well as 
possible and then generating a more advanced form of the text or adding enough 
corrective markup to the text to correct places where the program guessed incorrectly 
rather than always rendereing the same plain text repeatedly.

Using this gradient can we define a scalable set of rules that supports all the levels 
or do we have to have more than 1 system to cover the range?

One area of difficulty is the determination of paragraphs. While Worpad and such 
program wrap text only visually, many editors, including programming editors and email 
editors physically wrap the text using newlines. This gives rise to the convention of 
delimiting paragraphs with blank lines. Except in tables and lists where lines are 
considered separate. Since the wrapped text is the more difficult and handles the 
unwrapped text also, I think we need to deal with it.

I find that I want to use multiple lines in list elements and in some types of tables. 
I find that I assume that table and list items (which are individually marked) assume 
that each item or row constitutes a new paragraph. I will start a list (usually 
indented) after a paragraph without inserting a separating blank line.

When I generate a list that looks more like a table I usually intend to have a line on 
the left be in the same cell as the multiple lines on the right. e.g.:

-----
Twiki     Used for editing web pages using a browser.
          Supports automatic links of "wiki words" with multiple caps
               in the word. Add '?' if not in index, create from '?'
               link.
          Uses %...% for variables which are defined externally and
               are multi-level.
          Allows HTML tags to be embedded.
          Generates full navigation bar, editing, file attachments.
          Caegory mechanism is very interesting. Compare to jrju.

No-Tags   A plain text markup system.
-----

I would like our system to do something reasonable with this or to have a way to make 
it do something reasonable without undue hardship.

A paragraph is determined by 1 or more blank lines or a list item:
-----
     * Review the use of embedded HTML tags as is done Twiki
     * Review the question of embedded language. Mark embedded code
          with '\{% ... %\}' to allow for arbitrary code

-----
There are 2 paragraphs here and they are both bulleted items.

I hope that this framework will allow us to establish a scalable set of rules that we 
can still implement with reasonable effort to cover a wide range of eText applications.

That's all for the moment.

-- 
To unsubscribe from this list, please send an email to
[EMAIL PROTECTED] with "unsubscribe" in the 
subject, without the quotes.
[REBOL] eText

Reply via email to