Hello [EMAIL PROTECTED]!

On 20-Gen-00, you wrote:

 r> I still can't do what I need to do!

Tell me if this can be useful:

>> html-source: {
{    <TABLE>
{    <TR><TD>ALPHA</TD><TD>ONE</TD></TR>
{    <TR><TD>BETA</TD><TD>TWO</TD></TR>
{    <TR><TD COLSPAN=2>DUMMY LINE ONE</TD></TR>
{    <TR><TD>GAMMA</TD><TD>THREE</TD></TR>
{    <TR><TD>DELTA</TD><TD>FOUR</TD></TR>
{    <TR><TD COLSPAN=2>DUMMY LINE TWO</TD></TR>
{    <TR><TD>EPSILON</TD><TD>FIVE</TD></TR>
{    </TABLE>
{    }
== {
<TABLE>
<TR><TD>ALPHA</TD><TD>ONE</TD></TR>
<TR><TD>BETA</TD><TD>TWO</TD></TR>
<TR><TD COLSPAN=2>DUMMY LINE ONE</TD></TR>
<TR>...
>> parse-html html-source
== ["ALPHA" "ONE" "BETA" "TWO" "GAMMA" "THREE" "DELTA" "FOUR" "EPSILON" "FIVE"]
>> foreach [name value] parse-html html-source [
[    print [name "=" value]
[    ]
ALPHA = ONE
BETA = TWO
GAMMA = THREE
DELTA = FOUR
EPSILON = FIVE

This function has the advantage to be able to parse malformed HTML
too:

>> malformed-html: {
{    I hope you don't have to cope with things like this!
{    <HTML>
{    <TR><TD>You don't want</TD><TD>this, do you?</TD></TR>
{    
{    Some unwanted content...
{    
{    <TABLE>
{    Bla bla bla        
{    <TR> Hey, look: this is very bad HTML!
{    
{    <TD>
{    
{    ALPHA</TD><TD>
{    
{    ONE</TD></TR><TR>
{    ...<TD>a</TD>b<TD>c</TD>d<TD>e</TD>
{    </TR>
{    <TR><TD>BETA</TD><TD>TWO</TD></TR>
{    and so on...
{    </TABLE>
{    </BODY>                            
{    </HTML>
{    }
== {
I hope you don't have to cope with things like this!
<HTML>
<TR><TD>You don't want</TD><TD>this, do you?</TD></TR>

Some unwan...
>> parse-html malformed-html
== ["^/^/ALPHA" "^/^/ONE" "BETA" "TWO"]

It will also accept other tags inside the cells, stripping them:

>> parse-html {<TABLE><TR><TD>Some tags <B>here</B></TD><TD>etc.</TD></TR></TABLE>}
== ["Some tags here" "etc."]

And now, here's the code. It is a state machine, so perhaps there
are simpler ways to do this, but this is very flexible.

REBOL []

html-rule: [some [tag | text]]
tag: [ "<" [
    "TABLE" (start-table) |
    "/TABLE" (end-table) |
    "TD" (start-cell) |
    "/TD" (end-cell) |
    "TR" (start-row) |
    "/TR" (end-row) |
    none ]
    thru ">"
]
text: [
    copy content some characters
    (process content)
]
characters: complement charset "<>"

result: make block! 10
buffer: make block! 10

discard: func [
    "Discards unwanted content"
    content [string!]
] []

store: func [
    "Store content"
    content [string!]
] [
    append last buffer content
]

process: :discard

in-row: reduce [
    func [
        "Cell start"
    ] [
        append buffer make string! 100
        process: :store
    ]
    func [
        "Cell end"
    ] [
        process: :discard
    ]
]
not-in-row: reduce [none none]

in-table: reduce [
    none
    none
    func [
        "Row start"
    ] [
        set [start-cell end-cell] in-row
        clear buffer
        process: :discard
    ]
    func [
        "Row end"
    ] [
        if 2 = length? buffer [
            append result buffer
        ]
        set [start-cell end-cell] not-in-row
        process: :discard
    ]
]
not-in-table: reduce [none none none none]

set [start-cell end-cell start-row end-row] not-in-table

start-table: func [
    "Table start"
] [
    set [start-cell end-cell start-row end-row] in-table
]

end-table: func [
    "Table end"
] [
    set [start-cell end-cell start-row end-row] not-in-table
]

parse-html: func [
    "Parse the HTML source"
    html [string!]
] [
    clear result
    parse/all html html-rule
    result
]


Regards,
    Gabriele.
-- 
o--------------------) .-^-. (----------------------------------o
| Gabriele Santilli / /_/_\_\ \ Amiga Group Italia --- L'Aquila |
| GIESSE on IRC     \ \-\_/-/ /  http://www.amyresource.it/AGI/ |
o--------------------) `-v-' (----------------------------------o

Reply via email to