[sword-devel] Improvements to osis2mod to handle XML comments and
correctly

John Zaitseff Thu, 04 Feb 2010 19:36:27 -0800

Dear SWORD developers,

Firstly, thanks for developing the SWORD library!  I have been using
this library, in conjunction with the BibleTime front-end, for many
years.


I have recently started to develop some OSIS documents of my own.
In doing so, I found that the XML parser in osis2mod is somewhat
fragile---something that you are, no doubt, aware of.

In particular, osis2mod does not handle XML comments at all, nor
does it correctly parse the <header> element.  Being able to handle
XML comments is, I think, quite important.  For example, I like to
document the SVN revision ID, for example, in an XML comment.  I
also like to be able to comment out sections of the XML file when
testing the osis2mod parser.

Furthermore, the osis2mod XML parser looks for the first <div> in
the document, no matter where that occurs.  In particular, if the
OSIS document includes a <revisionDesc> tag in the header, it will
have <p> tags as well---which will be translated by transformBSP()
into <div> tags---and get used as the starting point for the
document!

For this reason, I have generated a quick patch that will solve
these particular problems.  Could you please apply it to the SVN
head for utilities/osis2mod.cpp.  Comments are handled similar to
spaces: they are skipped.  And handleToken() now looks for the first
<div> after the </revision> end tag.

In general, I think that (perhaps eventually) the proper way to
parse XML is to use a library like libxml---which is designed
specifically for this purpose.

Yours truly,

John Zaitseff

-- 
John Zaitseff                    ,--_|\    The ZAP Group
Phone:  +61 2 9643 7737         /      \   Sydney, Australia
E-mail: j.zaits...@zap.org.au   \_,--._*   http://www.zap.org.au/
                                      v

Index: utilities/osis2mod.cpp
===================================================================
--- utilities/osis2mod.cpp	(revision 2505)
+++ utilities/osis2mod.cpp	(working copy)
@@ -595,6 +595,7 @@
 
 	// Flag used to indicate where useful text begins
 	static bool               firstDiv        = false;
+	static bool               headerEnded     = false;
 
 	// Retain the sID of book, chapter and verse (commentary) divs so that we can find them again.
 	// This relies on transformBSP.
@@ -635,9 +636,9 @@
 			}
 		}
 
-		// throw away everything up to the first div
+		// throw away everything up to the first div (that is outside the header)
 		if (!firstDiv) {
-			if (tokenName == "div") {
+			if (headerEnded && (tokenName == "div")) {
 				if (debug & DEBUG_OTHER) {
 					cout << "DEBUG(FOUND): Found first div and pitching prior material: " << text << endl;
 				}
@@ -954,8 +955,16 @@
 			}
 		}
 
-		// We haven't seen the first div so there is nothing to do.
+		// We haven't seen the first div outside the header so there is little to do.
 		if (!firstDiv) {
+			if (tokenName == "header") {
+				headerEnded = true;
+
+				if (debug & DEBUG_OTHER) {
+					cout << "DEBUG(FOUND): End of header found" << endl;
+				}
+			}
+
 			// Collect the content so it can be used to suggest the module's conf.
 			return false;
 		}
@@ -1306,6 +1315,16 @@
 }
 
 void processOSIS(istream& infile) {
+	typedef enum {
+		CS_NOT_IN_COMMENT,		// or seen starting "<"
+		CS_SEEN_STARTING_EXCLAMATION,
+		CS_SEEN_STARTING_HYPHEN,
+		CS_IN_COMMENT,
+		CS_SEEN_ENDING_HYPHEN,
+		CS_SEEN_SECOND_ENDING_HYPHEN,
+		CS_SEEN_ENDING_GREATER_THAN
+	} t_commentstate;
+
 	activeOsisID[0] = '\0';
 
 	strcpy(currentOsisID,"N/A");
@@ -1320,13 +1339,15 @@
 
 	SWBuf token;
 	SWBuf text;
+	bool incomment = false;
+	t_commentstate commentstate = CS_NOT_IN_COMMENT;
 	bool intoken = false;
 	bool inWhitespace = false;
 	bool seeingSpace = false;
 	unsigned char curChar = '\0';
 
 	while (infile.good()) {
-		
+
 		curChar = infile.get();
 
 		// skip the character if it is bad. infile.good() will catch the problem
@@ -1340,6 +1361,95 @@
 			continue;
 		}
 
+		// Handle XML comments starting with "<!--", ending with "-->"
+
+		if (intoken && !incomment) {
+			switch (commentstate) {
+				case CS_NOT_IN_COMMENT :
+					if (curChar == '!') {
+						commentstate = CS_SEEN_STARTING_EXCLAMATION;
+						token.append(curChar);
+						continue;
+					} else {
+						break;
+					}
+
+				case CS_SEEN_STARTING_EXCLAMATION :
+					if (curChar == '-') {
+						commentstate = CS_SEEN_STARTING_HYPHEN;
+						token.append(curChar);
+						continue;
+					} else {
+						commentstate = CS_NOT_IN_COMMENT;
+						break;
+					}
+
+				case CS_SEEN_STARTING_HYPHEN :
+					if (curChar == '-') {
+						incomment = true;
+						commentstate = CS_IN_COMMENT;
+						token.append(curChar);
+
+						if (debug & DEBUG_OTHER) {
+							cout << "DEBUG(COMMENTS): in comment" << endl;
+						}
+
+						continue;
+					} else {
+						commentstate = CS_NOT_IN_COMMENT;
+						break;
+					}
+
+				default:
+					cout << "FATAL(COMMENTS): unknown commentstate on comment start: " << commentstate << endl;
+					exit(EXIT_BAD_NESTING);
+			}
+		}
+
+		if (incomment) {
+			switch (commentstate) {
+				case CS_IN_COMMENT:
+					if (curChar == '-') {
+						commentstate = CS_SEEN_ENDING_HYPHEN;
+						continue;
+					} else {
+						// ignore the character
+						continue;
+					}
+
+				case CS_SEEN_ENDING_HYPHEN :
+					if (curChar == '-') {
+						commentstate = CS_SEEN_SECOND_ENDING_HYPHEN;
+						continue;
+					} else {
+						// ignore character
+						commentstate = CS_IN_COMMENT;
+						continue;
+					}
+
+				case CS_SEEN_SECOND_ENDING_HYPHEN :
+					if (curChar == '>') {
+						intoken = false;
+						incomment = false;
+						commentstate = CS_NOT_IN_COMMENT;
+
+						if (debug & DEBUG_OTHER) {
+							cout << "DEBUG(COMMENTS): out of comment" << endl;
+						}
+
+						continue;
+					} else {
+						// ignore character
+						commentstate = CS_IN_COMMENT;
+						continue;
+					}
+
+				default:
+					cout << "FATAL(COMMENTS): unknown commentstate on comment end: " << commentstate << endl;
+					exit(EXIT_BAD_NESTING);
+			}
+		}
+
 		// Outside of tokens merge adjacent whitespace
 		if (!intoken) {
 			seeingSpace = isspace(curChar);
@@ -1358,13 +1468,16 @@
 			inWhitespace = false;
 			token.append('>');
 			// take this isalpha if out to check for bugs in text
-			if ((isalpha(token[1])) || (isalpha(token[2]))) {
+			if (isalpha(token[1]) ||
+			    (((token[1] == '/') || (token[1] == '?')) && isalpha(token[2]))) {
 				//cout << "Handle:" << token.c_str() << endl;
 				XMLTag t = transformBSP(token.c_str());
 
 				if (!handleToken(text, t)) {
 					text.append(t);
 				}
+			} else {
+				cout << "WARNING(PARSE): malformed token: " << token << endl;
 			}
 			continue;
 		}

_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

[sword-devel] Improvements to osis2mod to handle XML comments and correctly

Reply via email to

[sword-devel] Improvements to osis2mod to handle XML comments and
correctly