Greetings to everybody in the *LFS projects.
PART 1: Sum of the previous bash work - for new readers of these posts.
PART 2: Current Status and Code Segments in C++ - for the seasoned reader.
PART 1: Historical Review
----------------------------
As most of you know, I have been working on a fully self - hosted alfs solution based on C++ after choosing it as the development language. The following post
is one concerning one of the most essential parts of the project, namely a fast, small and easy to debug / develop XML "parsing" class. Initially, the work was
done for providing a fully self hosted bash based solution to integrate into the jhalfs project; after talking to the jhalfs team I decided to move to a regular
programming language that would prove more adequate than a script - based solution. Thus this is not a portion for a bash based alfs solution but a C++ one.
For reviewing purposes, please consider reading the following messages:
x2sh revised:
Thu Feb 23 20:17:30 MST 2006
http://linuxfromscratch.org/pipermail/alfs-discuss/2006-February/007641.html
The above link describes a pure bash based method for extracting the significant data out of the entire book in about 1 (one) min using nothing but bash built
in commands for string manipulation (no third party binaries). This is a branch currently developed for personal use in other settings.
x2sh old "library - like" bash source
Mon Feb 20 22:14:33 MST 2006
http://linuxfromscratch.org/pipermail/alfs-discuss/2006-February/007633.html
The above link contains a character by character implementation of an XML parser (FSM - finite state machine like) as posted in the mailing list, _different_
from the one above. Very slow but functional (the revised version "parses" the book 60x times faster though...), consider it an abandoned prototype, posted
mainly for archiving purposes. Check the mailing list and the irc logs of the alfs-dev channel for discussions with official developers of the jhalfs / nalfs
project.
PART 2: Current Status and Code Segments in C++
-----------------------------------------------
When I started building this (yet another) XML "parser", the main goal was to have as few lines of code as possible, making it easier for me and others to
maintain / modify it. This parser tries to avoid as much as possible relying on a classical FSM prototype (Finite State Machine) that uses character by
character recognition of the semantically important strings. The FSM is to be used in a later setting for another use within the parsing module, but not as the
main engine for extracting significant XML syntax patterns or error controlling them. This is more of an approach and not a dogmatic view of things. Whatever is
to be built for this project is to be done using the standard C++ STL classes and the like for the sake of having it fully self hosted and as small as possible,
and as portable as possible.
Let us analyze the structure of a typical XML document, within which we find the following patterns when declaring semantically significant data of unequivocal
presence whenever and wherever met:
01.) <!ATTLIST ...>
02.) <!ELEMENT ...>
03.) <!ENTITY ...>
04.) <!NOTATION ...>
05.) <?name ... ?>
06.) <name ... />
07.) <!-- character permissible but the '--' string to occur -->
08.) <![CDATA[text]]>
09.) <![IGNORE[declarations]]>
10.) <![INCLUDE[declarations]]>
11.) <!DOCTYPE ...[declarations]>
12.) <name ... >
13.) </name>
14.) pure text with no special characters whatsoever
The following observations must be made:
OBV-01.)
Alphanumerical strings of non XML reserved characters that are delimited by a single (<,>) sequence bear significant syntactic meaning, while strings delimited
by a single (>,<) sequence do not. Single (<,<), single (>,>) when encountered outside string tokens extracted from a DTD stream, are considered to be part of
an invalid / spurious XML file.
OBV-02.)
Whitespace character sequences ( \t\r\v\n) have different separator meaning when within a pair(GL) or within a pair(LG) sequence. The first whitespace sequence
within a pair (GL) identifies an element, a preprocessing command, a declaration and so on starting immediately after the occurrence of a '<' character.
OBV-03.)
The <!DOCTYPE ... [declarations]...> where the last ... can be a whitespace character sequence (depends on the editor of the file, it is not a violation to do
so) is the only syntactically significant structure where pair(GG) and pair (LL) can be found without raising a fatal exception. Note the following excerpt from
the LFS XML source:
...
<!DOCTYPE sect1 PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd" [
<!ENTITY % general-entities SYSTEM "../general.ent">
%general-entities;
]>
...
This is completely equivalent to the following stream:
...
<!DOCTYPE sect1 PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd" [<!ENTITY % general-entities SYSTEM
"../general.ent"> %general-entities;]>
...
Note now:
<!DOCTYPE ...<!ENTITY...> ... > %general-entities;]>
Same applies for the less commonly used IGNORE and INCLUDE statements, which in any case fall within DOCTYPE and dealing with them is to be considered as
dealing with the DOCTYPE declaration (DTD is not the !DOCTYPE but rather the dtd file...). For the purposes of this document we will be limiting ourselves to
DOCTYPE, which is the hierarchically most important global container.
OBV-04.)
Of the 14 patterns that can be met in an XML source stream of characters, member 11 (DOCTYPE) and the combination of 12+13 are to be considered hierarchical
patterns, with everything else being considered as a stand - alone type.
OBV-05.)
With the notable exception of the consequences of OBV-03, we can deduct that for a given valid XML stream there is a "binary" separation that takes place
between character streams delimited by pair(LG) who require further parsing and those delimited by pair(GL) who do not. For a moment, we put aside the DOCTYPE
issue. Such pairs are found mixed with various whitespace character sequences ( \t\r\v\n) in various combinations, but we always follow the pattern below:
[FRAGMENT01]: ...<...>...<...>...<...>...
For a given line of XML source, let's limit ourselves to possible scenarios with the
first and the last delimiter characters (<),(>):
[PREFIX...](first of either < or >),...,(last of either < or >)[SUFFIX...]
--------------------------------------------------------------------------
case 1: [PREFIX...](<),...,(>)[SUFFIX...]
case 2: [PREFIX...](>),...,(>)[SUFFIX...]
case 3: [PREFIX...](<),...,(<)[SUFFIX...]
case 4: [PREFIX...](>),...,(<)[SUFFIX...]
case 5: [PREFIX...](NONE),...,(NONE)[SUFFIX...]
Provided FRAGMENT01 succession is met (no pair(GG) or pair (LL) are found), it
is easy to convert the above to successive strings like the ones below:
case 1: (<...>) -----> no prefix, no suffix, fully contains ONE of the
aforementioned patterns fully self - contained, parsing required.
case 2: !(<...>) -----> plain character data that do not need to be further
parsed, parameter substitution if necessary, but not parsing.
Dividing in this way the entire stream, you can opt whether to parse or not depending on whether there is delimitation or not, because you are working with big
chunks of strings. Depending on how this is implemented you can also have a very slim session of code doing this, without the need of continuously monitoring
state by observing the occurrence of specific reserved characters in significantly important sequences. This (5->2) conversion can be seen here, but please
check the entire code segment as posted towards the end of this post (NOTE: error control has been deliberately omited but it is present and functional in the
author's repository):
*SAMPLE C++ CODE EXCERPT*
...
while (!xmlLINE.empty())
{
if (!xmlBUFF.empty()) { xmlLINE = xmlBUFF + " " + xmlLINE; }
xmlBUFF.clear();
cTAG = xmlLINE.find(">");
oTAG = xmlLINE.find("<");
if (( oTAG == string::npos ) || ( cTAG == string::npos ))
{
if (( oTAG == string::npos ) && ( cTAG == string::npos ))
{
xmlITEM.push_back(xmlLINE);
xmlLINE.clear();
break;
}
else if (( oTAG != string::npos ) && ( cTAG == string::npos ))
{
xmlBUFF = xmlLINE.substr(oTAG);
xmlITEM.push_back(xmlLINE.substr(0, oTAG));
xmlLINE.clear();
break;
}
}
else
{
xmlITEM.push_back(xmlLINE.substr(0, oTAG));
xmlITEM.push_back(xmlLINE.substr(oTAG, cTAG + 1 - oTAG));
xmlLINE = xmlLINE.substr(cTAG + 1);
}
}
...
Notice that reserved character presence is not monitored by its presence, but rather by its absence, avoiding the need to monitor with even more elaborate if -
like control structures or switch - case constructs.
Let's get back at the DOCTYPE issue:
<!DOCTYPE ...<!ENTITY...> ... > %general-entities;]>
If we directly applied the above method here we would likely fail because of pair(GG) and pair (LL) presence. The problem can be solved if we wrap the entire
DOCTYPE declaration hierarchy within a single <,> container. the ']>' terminator does not constitute a valid choice at times, for it may be present within
DOCTYPE in such a way that any whitespace character sequence may separate the ] from the >. Although this is would be "bad practice" for some XML authors, it is
not something that is unlikely to not be found. A [ or a ] character found within DOCTYPE declaration, is unlikely to be an unequivocal reference point unless a
lot of other variables are considered, making it not the optimal choice for a "state" changing code block...
This is how this problem can be solved:
<!DOCTYPE _sect1_ PUBLIC...[...<!ENTITY...> ... > %general-entities;]>...
_<sect1_....
Obviously the first non - comment (<!--) (also non <?...) XML semantically significant structure once DTD terminates _HAS_ (not yelling, ever, even if other
people may think otherwise, plain text data conveys no sound of unequivocal significance/tone/timber/chrome/emotional content :) ) to be the document _root_
element; One hierarchy ends (DOCTYPE) another one initiates (root element). There is a single root element and it has that particular position. Everything else
found there other than it (or comment only, this is omitted for simplicity but there is a functional code segment in the author's depo) means that we have a non
well - formed document (and possibly invalid/corrupt at times).
(NOTE: the very first [ and the very last ] within a DOCTYPE declaration are
referred to as Declaration Subset Open (DSO) and Declaration Subset Close
(DSC)).
Using such concepts as above, the nZyme "tokenizer" for XML is presented, as POC in C++ code, simply cut the code block below. Please avoid using it when
comments are nested within the DOCTYPE declaration or immediately after it, this is POC code; The version under development incorporates error controls, full
comment support as well as inline parsing among other features. Care has been given to have as many useful comments as possible, for this a previous build
without the entirety of current working / under process features has been posted. Comments both public and private are expected for evaluation of the community
response. If possible, it is preferable to meet with alfs members in the IRC project channels previous appointment.
Thank you,
George Makrydakis
gmak
----
//-------------------------------------------------------------------------------------------------------
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
using namespace std;
/*
* The nZyme (sub?) Project
*
* A collection of POC code segments for an XML parser / data extractor /
data manipulator
* originally created for the ALFS project. The following is a "tokenizer"
part.
*
* author : George Makrydakis
* info : gmakmail /at/ gmail >d0t< com
* license : Pending (OSI Compatible, either GPL or BSD .. in
progress)
* revision: A2 - POC branch
* date : March 11th 2006
*
*
*
*/
int main (int argc, char **argv)
{
vector<string> xmlVECT; // vector containing separate raw line segments
from the XML file
vector<string> xmlITEM; // vector containing formatted string tokens as
out of the tokenizer
string xmlBUFF; // holds a buffered string
string dtdROOT; // holds the root element name
string xmlLINE; // holds a single member of the xmlVECT vector
string dtdBUFF; // holds a buffered string while processing DTD
int lnct; // holds a line counter variable
int cTAG; // within a given string, index to a usable within code
segments '<' character
int oTAG; // within a given string, index to a usable within code
segments '>' character
int sTAG; // within a given string, index to a usable whitespace
or non whitespace sequence
int lnct2; // holds a line counter variable
// ---> irrelevant part for the POC STARTS
----------------------------------------------------
if (argc != 2)
{
printf("Usage: %s [VALID, WELL FORMED XML FILE]\n", argv[0]);
return(-1);
}
ifstream xmlFILE(argv[1]);
/* note type: WARNING
* note date: Mar/10/2006
* note info: File loading in memory
*
* Loading the entire file in memory is a quick and dirty solution,
* definetely not the optimal one, but fits its purpose in the POC
* code segment for the parser.
*
*/
if ( xmlFILE.is_open() )
{
while (getline(xmlFILE,xmlBUFF,'\n'))
{
xmlVECT.push_back(xmlBUFF.erase(0,xmlBUFF.find_first_not_of(" \t\n\r\v")));
}
xmlFILE.close();
xmlBUFF.clear();
}
else
{
cout << "file not found!" << endl;
return -1;
}
// ---> irrelevant part for the POC FINISHES
--------------------------------------------------
/* Technical Preamble & Disclaimer
*
* This particular code block is offered as a monolithic procedure
- oriented programming
* POC code. The goal of the nZyme XML parsing (sub?)project is to
have a token based XML
* parser, since "single" character manipulation issues are
forwarded to the C++ String class
* of the STL C++ Library. In anytime the C++ String class can be
substituted by one that
* may prove more efficient, but still providing the same exact
functionality. This is a fact
* that is independent of the POC code presented in these code
segments.
*
* It is also rather obvious that a FSM model is to be applied
around the principles laid out
* here, especially once the proper OOP oriented aspect of this
project is implemented. Having
* a token - oriented FSM rather than a single char - oriented FSM
can be less cumbersome to
* both develop and debug.
*
* The entire code segment is submitted to the ALFS project
developer group as a RFC document.
*
*/
for (lnct = 0; lnct < xmlVECT.size(); lnct++)
{
xmlLINE = xmlVECT.at(lnct);
/* POC CODE SEGMENT: <!DOCTYPE "pair - matching"
compatibility.
*
* The following section is another POC segment, designed
especially for the DTD
* portion of the XML document currently under processing,
the entire segment is
* conditioned upon the presence of the <!DOCTYPE string
token within the given
* line. Tokens begining with <! are to be having their
own sections as the OOP
* implementation of the POC presented in these code
segments is deployed. Easily
* visible is the use of the dtdROOT string, which once
defined can serve also as
* a well - formedness cue for the validation part of this
XML parser (Both DTD
* and rest of the text).
*
*/
if (xmlLINE.find("<!DOCTYPE") != string::npos)
{
xmlLINE = xmlLINE.substr(xmlLINE.find("<!DOCTYPE") +
10);
xmlLINE = xmlLINE.erase(0, xmlLINE.find_first_not_of("
\t\n\r\v"));
lnct2 = lnct;
while (dtdROOT.empty())
{
sTAG = xmlLINE.find_first_not_of(" \t\n\r\v");
if (sTAG != string::npos)
{
dtdBUFF = xmlLINE.substr(sTAG,
xmlLINE.find_first_of(" \t\n\r\v"));
dtdBUFF = dtdBUFF.erase(0,
dtdBUFF.find_first_not_of(" \t\n\r\v"));
xmlLINE = xmlLINE.substr(sTAG + dtdBUFF.size(),
xmlLINE.find_first_of(" \t\n\r\v"));
if ((dtdBUFF.find_first_of("</[]&;>")
== string::npos))
{
if (!(( dtdBUFF == "PUBLIC") || ( dtdBUFF
== "SYSTEM")))
{
while (xmlLINE.find("<"
+ dtdBUFF) == string::npos)
{
/*
* dtdBUFF
contains the string that is relative to the dtdROOT element
* simply proceed until
the correcline containing the ("<" + dtdBUFF)
* string
is met. All lines are joined to one another until that happens.
*/
xmlLINE = xmlLINE +
" " + xmlVECT.at(lnct2);
lnct2++;
}
lnct = lnct2;
xmlITEM.push_back("<!DOCTYPE " +
xmlLINE.substr(0, xmlLINE.find("<" + dtdBUFF)));
xmlLINE =
xmlLINE.substr(xmlLINE.find("<" + dtdBUFF)) + xmlVECT.at(lnct);
dtdROOT = dtdBUFF;
}
else if (( dtdBUFF == "PUBLIC") || (
dtdBUFF == "SYSTEM"))
{
// Error control is
elementary, we are dealing with valid + well formed XML for the
// time being.
cout << "fatal error! root element
not declared within DOCTYPE statement" << endl;
return 1;
}
}
}
lnct2++;
if (dtdROOT.empty()) {xmlLINE =
xmlVECT.at(lnct2);}
}
}
/* POC CODE SEGMENT: Tokenizing according to "pair -
matching".
*
* The main "tokenizer" code segment is
constrained within the following single while() loop.
* The nZyme XML parsing effort is relying upon
elaboration of significant / non significant
* string tokens. The following is the section
that is solely responsible for providing them,
* according to the pair(LG)/pair(GL) binary
identity of the formatted string token to be
* provided for inline parsing.
*
*/
while (!xmlLINE.empty())
{
if (!xmlBUFF.empty()) { xmlLINE = xmlBUFF + " "
+ xmlLINE; }
xmlBUFF.clear();
cTAG = xmlLINE.find(">");
oTAG = xmlLINE.find("<");
if (( oTAG == string::npos ) || ( cTAG ==
string::npos ))
{
if (( oTAG == string::npos ) && ( cTAG
== string::npos ))
{
xmlITEM.push_back(xmlLINE);
xmlLINE.clear();
break;
}
else if (( oTAG != string::npos ) && (
cTAG == string::npos ))
{
xmlBUFF =
xmlLINE.substr(oTAG);
xmlITEM.push_back(xmlLINE.substr(0, oTAG));
xmlLINE.clear();
break;
}
}
else
{
xmlITEM.push_back(xmlLINE.substr(0,
oTAG));
xmlITEM.push_back(xmlLINE.substr(oTAG,
cTAG + 1 - oTAG));
xmlLINE = xmlLINE.substr(cTAG + 1);
}
}
}
// for screen output, unimportant.
for (lnct = 0; lnct < xmlITEM.size(); lnct++)
{
if (xmlITEM.at(lnct) != "" ) // note: to be gotten rid
of.
{
cout << xmlITEM.at(lnct) << endl;
}
}
xmlVECT.clear();
xmlITEM.clear();
return 0;
}
//-------------------------------------------------------------------------------------------------------
--
http://linuxfromscratch.org/mailman/listinfo/alfs-discuss
FAQ: http://www.linuxfromscratch.org/faq/
Unsubscribe: See the above information page