RFC[01] - nZyme C++ code segment

George Makrydakis Mon, 13 Mar 2006 14:58:24 -0800

Greetings to everybody in the *LFS projects.

PART 1: Sum of the previous bash work - for new readers of these posts.
PART 2: Current Status and Code Segments in C++ - for the seasoned reader.



PART 1: Historical Review
----------------------------

As most of you know, I have been working on a fully self - hosted alfs solution based on C++ after choosing it as the development language. The following postis one concerning one of the most essential parts of the project, namely a fast, small and easy to debug / develop XML "parsing" class. Initially, the work wasdone for providing a fully self hosted bash based solution to integrate into the jhalfs project; after talking to the jhalfs team I decided to move to a regularprogramming language that would prove more adequate than a script - based solution. Thus this is not a portion for a bash based alfs solution but a C++ one.


For reviewing purposes, please consider reading the following messages:

x2sh revised:
Thu Feb 23 20:17:30 MST 2006

http://linuxfromscratch.org/pipermail/alfs-discuss/2006-February/007641.html

The above link describes a pure bash based method for extracting the significant data out of the entire book in about 1 (one) min using nothing but bash builtin commands for string manipulation (no third party binaries). This is a branch currently developed for personal use in other settings.


x2sh old "library - like" bash source
Mon Feb 20 22:14:33 MST 2006

http://linuxfromscratch.org/pipermail/alfs-discuss/2006-February/007633.html

The above link contains a character by character implementation of an XML parser (FSM - finite state machine like) as posted in the mailing list, _different_from the one above. Very slow but functional (the revised version "parses" the book 60x times faster though...), consider it an abandoned prototype, postedmainly for archiving purposes. Check the mailing list and the irc logs of the alfs-dev channel for discussions with official developers of the jhalfs / nalfsproject.


PART 2: Current Status and Code Segments in C++
-----------------------------------------------

When I started building this (yet another) XML "parser", the main goal was to have as few lines of code as possible, making it easier for me and others tomaintain / modify it. This parser tries to avoid as much as possible relying on a classical FSM prototype (Finite State Machine) that uses character bycharacter recognition of the semantically important strings. The FSM is to be used in a later setting for another use within the parsing module, but not as themain engine for extracting significant XML syntax patterns or error controlling them. This is more of an approach and not a dogmatic view of things. Whatever isto be built for this project is to be done using the standard C++ STL classes and the like for the sake of having it fully self hosted and as small as possible,and as portable as possible.

Let us analyze the structure of a typical XML document, within which we find the following patterns when declaring semantically significant data of unequivocalpresence whenever and wherever met:


        01.)    <!ATTLIST ...>
        02.)    <!ELEMENT ...>
        03.)    <!ENTITY ...>
        04.)    <!NOTATION ...>
        05.)    <?name ... ?>
        06.)    <name ... />
        07.)    <!-- character permissible but the '--' string to occur -->
        08.)    <![CDATA[text]]>
        09.)    <![IGNORE[declarations]]>
        10.)    <![INCLUDE[declarations]]>
        11.)    <!DOCTYPE ...[declarations]>
        12.)    <name ... >
        13.)    </name>
        14.)    pure text with no special characters whatsoever

The following observations must be made:

OBV-01.)

Alphanumerical strings of non XML reserved characters that are delimited by a single (<,>) sequence bear significant syntactic meaning, while strings delimitedby a single (>,<) sequence do not. Single (<,<), single (>,>) when encountered outside string tokens extracted from a DTD stream, are considered to be part ofan invalid / spurious XML file.


OBV-02.)

Whitespace character sequences ( \t\r\v\n) have different separator meaning when within a pair(GL) or within a pair(LG) sequence. The first whitespace sequencewithin a pair (GL) identifies an element, a preprocessing command, a declaration and so on starting immediately after the occurrence of a '<' character.


OBV-03.)

The <!DOCTYPE ... [declarations]...> where the last ... can be a whitespace character sequence (depends on the editor of the file, it is not a violation to doso) is the only syntactically significant structure where pair(GG) and pair (LL) can be found without raising a fatal exception. Note the following excerpt fromthe LFS XML source:

...

<!DOCTYPE sect1 PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" 
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"; [
<!ENTITY % general-entities SYSTEM "../general.ent">
%general-entities;
]>

...

This is completely equivalent to the following stream:
...

<!DOCTYPE sect1 PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" 
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"; [<!ENTITY % general-entities SYSTEM 
"../general.ent"> %general-entities;]>

...

Note now:

<!DOCTYPE ...<!ENTITY...> ... > %general-entities;]>

Same applies for the less commonly used IGNORE and INCLUDE statements, which in any case fall within DOCTYPE and dealing with them is to be considered asdealing with the DOCTYPE declaration (DTD is not the !DOCTYPE but rather the dtd file...). For the purposes of this document we will be limiting ourselves toDOCTYPE, which is the hierarchically most important global container.


OBV-04.)

Of the 14 patterns that can be met in an XML source stream of characters, member 11 (DOCTYPE) and the combination of 12+13 are to be considered hierarchicalpatterns, with everything else being considered as a stand - alone type.



OBV-05.)

With the notable exception of the consequences of OBV-03, we can deduct that for a given valid XML stream there is a "binary" separation that takes placebetween character streams delimited by pair(LG) who require further parsing and those delimited by pair(GL) who do not. For a moment, we put aside the DOCTYPEissue. Such pairs are found mixed with various whitespace character sequences ( \t\r\v\n) in various combinations, but we always follow the pattern below:


[FRAGMENT01]: ...<...>...<...>...<...>...

For a given line of XML source, let's limit ourselves to possible scenarios with the 
first and the last delimiter characters (<),(>):

[PREFIX...](first of either < or >),...,(last of either < or >)[SUFFIX...]
--------------------------------------------------------------------------

case 1: [PREFIX...](<),...,(>)[SUFFIX...]
case 2: [PREFIX...](>),...,(>)[SUFFIX...]
case 3: [PREFIX...](<),...,(<)[SUFFIX...]
case 4: [PREFIX...](>),...,(<)[SUFFIX...]
case 5: [PREFIX...](NONE),...,(NONE)[SUFFIX...]

Provided FRAGMENT01 succession is met (no pair(GG) or pair (LL) are found), it 
is easy to convert the above to successive strings like the ones below:

case 1:  (<...>) -----> no prefix, no suffix, fully contains ONE of the 
aforementioned patterns fully self - contained, parsing required.
case 2: !(<...>) -----> plain character data that do not need to be further 
parsed, parameter substitution if necessary, but not parsing.

Dividing in this way the entire stream, you can opt whether to parse or not depending on whether there is delimitation or not, because you are working with bigchunks of strings. Depending on how this is implemented you can also have a very slim session of code doing this, without the need of continuously monitoringstate by observing the occurrence of specific reserved characters in significantly important sequences. This (5->2) conversion can be seen here, but pleasecheck the entire code segment as posted towards the end of this post (NOTE: error control has been deliberately omited but it is present and functional in theauthor's repository):


*SAMPLE C++ CODE EXCERPT*

...

while (!xmlLINE.empty())
{
        if (!xmlBUFF.empty()) { xmlLINE = xmlBUFF + " " + xmlLINE; }
        xmlBUFF.clear();
        cTAG = xmlLINE.find(">");
        oTAG = xmlLINE.find("<");
        if (( oTAG == string::npos ) || ( cTAG == string::npos ))
        {
                if (( oTAG == string::npos ) && ( cTAG == string::npos ))
                {
                        xmlITEM.push_back(xmlLINE);
                        xmlLINE.clear();
                        break;
                }
                else if (( oTAG != string::npos ) && ( cTAG == string::npos ))
                {
                        xmlBUFF = xmlLINE.substr(oTAG);
                        xmlITEM.push_back(xmlLINE.substr(0, oTAG));
                        xmlLINE.clear();
                        break;                                          
                }
        }
        else
        {       
                xmlITEM.push_back(xmlLINE.substr(0, oTAG));
                xmlITEM.push_back(xmlLINE.substr(oTAG, cTAG + 1 - oTAG));
                xmlLINE = xmlLINE.substr(cTAG + 1);
        }
}

...

Notice that reserved character presence is not monitored by its presence, but rather by its absence, avoiding the need to monitor with even more elaborate if -like control structures or switch - case constructs.


Let's get back at the DOCTYPE issue:

<!DOCTYPE ...<!ENTITY...> ... > %general-entities;]>

If we directly applied the above method here we would likely fail because of pair(GG) and pair (LL) presence. The problem can be solved if we wrap the entireDOCTYPE declaration hierarchy within a single <,> container. the ']>' terminator does not constitute a valid choice at times, for it may be present withinDOCTYPE in such a way that any whitespace character sequence may separate the ] from the >. Although this is would be "bad practice" for some XML authors, it isnot something that is unlikely to not be found. A [ or a ] character found within DOCTYPE declaration, is unlikely to be an unequivocal reference point unless alot of other variables are considered, making it not the optimal choice for a "state" changing code block...


This is how this problem can be solved:

<!DOCTYPE _sect1_ PUBLIC...[...<!ENTITY...> ... > %general-entities;]>... 
_<sect1_....

Obviously the first non - comment (<!--) (also non <?...) XML semantically significant structure once DTD terminates _HAS_ (not yelling, ever, even if otherpeople may think otherwise, plain text data conveys no sound of unequivocal significance/tone/timber/chrome/emotional content :) ) to be the document _root_element; One hierarchy ends (DOCTYPE) another one initiates (root element). There is a single root element and it has that particular position. Everything elsefound there other than it (or comment only, this is omitted for simplicity but there is a functional code segment in the author's depo) means that we have a nonwell - formed document (and possibly invalid/corrupt at times).


(NOTE: the very first [ and the very last ] within a DOCTYPE  declaration are 
referred to as Declaration Subset Open (DSO) and Declaration Subset Close 
(DSC)).

Using such concepts as above, the nZyme "tokenizer" for XML is presented, as POC in C++ code, simply cut the code block below. Please avoid using it whencomments are nested within the DOCTYPE declaration or immediately after it, this is POC code; The version under development incorporates error controls, fullcomment support as well as inline parsing among other features. Care has been given to have as many useful comments as possible, for this a previous buildwithout the entirety of current working / under process features has been posted. Comments both public and private are expected for evaluation of the communityresponse. If possible, it is preferable to meet with alfs members in the IRC project channels previous appointment.


Thank you,

George Makrydakis

gmak
----

//-------------------------------------------------------------------------------------------------------
#include <iostream>
#include <fstream>
#include <string>
#include <vector>

using namespace std;


/*
 *      The nZyme (sub?) Project
 *
 *      A collection of POC code segments for an XML parser / data extractor / 
data manipulator
 *      originally created for the ALFS project. The following is a "tokenizer" 
part.
 *
 *      author  :       George Makrydakis
 *      info    :       gmakmail /at/ gmail >d0t< com
 *      license :       Pending (OSI Compatible, either GPL or BSD .. in 
progress)
 *      revision:       A2 - POC branch
 *      date    :       March 11th 2006
 *
 *
 *
 */

        int main (int argc, char **argv)
        {
        
        vector<string> xmlVECT; // vector containing separate raw line segments 
from the XML file
        vector<string> xmlITEM; // vector containing formatted string tokens as 
out of the tokenizer

        string xmlBUFF; // holds a buffered string
        string dtdROOT; // holds the root element name
        string xmlLINE; // holds a single member of the xmlVECT vector
        string dtdBUFF; // holds a buffered string while processing DTD

        int lnct;       // holds a line counter variable
        int cTAG;       // within a given string, index to a usable within code 
segments '<' character
        int oTAG;       // within a given string, index to a usable within code 
segments '>' character
        int sTAG;       // within a given string, index to a usable whitespace 
or non whitespace sequence
        int lnct2;      // holds a line counter variable




        // ---> irrelevant part for the POC STARTS 
----------------------------------------------------
        if (argc != 2)
        {
                printf("Usage: %s [VALID, WELL FORMED XML FILE]\n", argv[0]);
                return(-1);
        }
        ifstream xmlFILE(argv[1]);

        /*      note type:      WARNING
         *      note date:      Mar/10/2006
         *      note info:      File loading in memory  
         *
         *      Loading the entire file in memory is a quick and dirty solution,
         *      definetely not the optimal one, but fits its purpose in the POC
         *      code segment for the parser.
         *
         */

        if ( xmlFILE.is_open() )
        {
                while (getline(xmlFILE,xmlBUFF,'\n'))
                {
                        
xmlVECT.push_back(xmlBUFF.erase(0,xmlBUFF.find_first_not_of(" \t\n\r\v")));
                }
                xmlFILE.close();
                xmlBUFF.clear();
        }
        else
        {
                cout << "file not found!" << endl;
                return -1;
        }

        // ---> irrelevant part for the POC FINISHES 
--------------------------------------------------


        /*      Technical Preamble & Disclaimer
         *
         *      This particular code block is offered as a monolithic procedure 
- oriented programming
         *      POC code. The goal of the nZyme XML parsing (sub?)project is to 
have a token based XML
         *      parser, since "single" character manipulation issues are 
forwarded to the C++ String class
         *      of the STL C++ Library. In anytime the C++ String class can be 
substituted by one that
         *      may prove more efficient, but still providing the same exact 
functionality. This is a fact
         *      that is independent of the POC code presented in these code 
segments.
         *
         *      It is also rather obvious that a FSM model is to be applied 
around the principles laid out
         *      here, especially once the proper OOP oriented aspect of this 
project is implemented. Having
         *      a token - oriented FSM rather than a single char - oriented FSM 
can be less cumbersome to
         *      both develop and debug.
         *
         *      The entire code segment is submitted to the ALFS project 
developer group as a RFC document.
         *
         */

        for (lnct = 0; lnct < xmlVECT.size(); lnct++)
        {
                xmlLINE = xmlVECT.at(lnct);

                /*      POC CODE SEGMENT: <!DOCTYPE "pair - matching" 
compatibility.
                 *
                 *      The following section is another POC segment, designed 
especially for the DTD
                 *      portion of the XML document currently under processing, 
the entire segment is
                 *      conditioned upon the presence of the <!DOCTYPE string 
token within the given
                 *      line. Tokens begining with <! are to be having their 
own sections as the OOP
                 *      implementation of the POC presented in these code 
segments is deployed. Easily
                 *      visible is the use of the dtdROOT string, which once 
defined can serve also as
                 *      a well - formedness cue for the validation part of this 
XML parser (Both DTD
                 *      and rest of the text).
                 *
                 */

                if (xmlLINE.find("<!DOCTYPE") != string::npos)
                {
                        xmlLINE = xmlLINE.substr(xmlLINE.find("<!DOCTYPE") + 
10);
                        xmlLINE = xmlLINE.erase(0, xmlLINE.find_first_not_of(" 
\t\n\r\v"));
                        lnct2 = lnct;
                        while (dtdROOT.empty())
                        {
                                sTAG = xmlLINE.find_first_not_of(" \t\n\r\v");
                                if (sTAG != string::npos)
                                {
                                        dtdBUFF = xmlLINE.substr(sTAG, 
xmlLINE.find_first_of(" \t\n\r\v"));
                                        dtdBUFF = dtdBUFF.erase(0, 
dtdBUFF.find_first_not_of(" \t\n\r\v"));
                                        xmlLINE = xmlLINE.substr(sTAG + dtdBUFF.size(), 
xmlLINE.find_first_of(" \t\n\r\v"));
                                        if ((dtdBUFF.find_first_of("</[]&;>") 
== string::npos))
                                        {
                                                if (!(( dtdBUFF == "PUBLIC") || ( dtdBUFF 
== "SYSTEM")))
                                                {
                                                        while (xmlLINE.find("<" 
+ dtdBUFF) == string::npos)
                                                        {
                                                                /*
                                                                 *      dtdBUFF 
contains the string that is relative to the dtdROOT element
                                                                 *      simply proceed until 
the correcline containing the ("<" + dtdBUFF)
                                                                 *      string 
is met. All lines are joined to one another until that happens.
                                                                 */
                                                                xmlLINE = xmlLINE + 
"  " + xmlVECT.at(lnct2);
                                                                lnct2++;
                                                        }
                                                        lnct = lnct2;
                                                        xmlITEM.push_back("<!DOCTYPE " + 
xmlLINE.substr(0, xmlLINE.find("<" + dtdBUFF)));
                                                        xmlLINE = 
xmlLINE.substr(xmlLINE.find("<" + dtdBUFF)) + xmlVECT.at(lnct);
                                                        dtdROOT = dtdBUFF;
                                                        
                                                }
                                                else if (( dtdBUFF == "PUBLIC") || ( 
dtdBUFF == "SYSTEM"))
                                                {
                                                        // Error control is 
elementary, we are dealing with valid + well formed XML for the
                                                        // time being.
                                                        cout << "fatal error! root element 
not declared within DOCTYPE statement" << endl;
                                                        return 1;
                                                }
                                        }
                                }
                                lnct2++;
                                if (dtdROOT.empty()) {xmlLINE = 
xmlVECT.at(lnct2);}
                        }
                }

                        /*      POC CODE SEGMENT: Tokenizing according to "pair - 
matching".
                         *
                         *      The main "tokenizer" code segment is 
constrained within the following single while() loop.
                         *      The nZyme XML parsing effort is relying upon 
elaboration of significant / non significant
                         *      string tokens. The following is the section 
that is solely responsible for providing them,
                         *      according to the pair(LG)/pair(GL) binary 
identity of the formatted string token to be
                         *      provided for inline parsing.
                         *
                         */

                        while (!xmlLINE.empty())
                        {
                                if (!xmlBUFF.empty()) { xmlLINE = xmlBUFF + " " 
+ xmlLINE; }
                                xmlBUFF.clear();
                                cTAG = xmlLINE.find(">");
                                oTAG = xmlLINE.find("<");
                                if (( oTAG == string::npos ) || ( cTAG == 
string::npos ))
                                {
                                        if (( oTAG == string::npos ) && ( cTAG 
== string::npos ))
                                        {
                                                xmlITEM.push_back(xmlLINE);
                                                xmlLINE.clear();
                                                break;
                                        }
                                        else if (( oTAG != string::npos ) && ( 
cTAG == string::npos ))
                                        {
                                                        xmlBUFF = 
xmlLINE.substr(oTAG);
                                                        
xmlITEM.push_back(xmlLINE.substr(0, oTAG));
                                                        xmlLINE.clear();
                                                        break;                  
                        
                                        }
                                }
                                else
                                {       
                                        xmlITEM.push_back(xmlLINE.substr(0, 
oTAG));
                                        xmlITEM.push_back(xmlLINE.substr(oTAG, 
cTAG + 1 - oTAG));
                                        xmlLINE = xmlLINE.substr(cTAG + 1);
                                }
                        }
                }

        // for screen output, unimportant.

        for (lnct = 0; lnct < xmlITEM.size(); lnct++)
        {
                        if (xmlITEM.at(lnct) != "" ) // note: to be gotten rid 
of.
                        {
                        cout << xmlITEM.at(lnct) << endl;
                        }
        }
        xmlVECT.clear();
        xmlITEM.clear();

        return 0;
        }

//-------------------------------------------------------------------------------------------------------

        

--
http://linuxfromscratch.org/mailman/listinfo/alfs-discuss
FAQ: http://www.linuxfromscratch.org/faq/
Unsubscribe: See the above information page

RFC[01] - nZyme C++ code segment

Reply via email to