Hello Everybody,
Some reviewing and a _NEW_ post with - closer to a solution of "our" "problem".
1. The x2sh library snapshot I posted is completely agnostic of what it parses, but it is slow when used in a big series of documents. I have tried this myself,
in order to parse the entire LFS book and load it into a semantically significant "table" simulation using arrays in bash takes at about 40 - 50 min. This is
because the script code is complex, requires a lot of counter variables while parsing, but most of all, it reads everything in the xml source it parses
character by character. This is completely unacceptable under _any_ circumstance. Fact is, that however "weird" the syntax in the file may be, because of its
character it would parse literally anything.
2. The genXML boosting parser script is extremely fast, for it takes only 30 - 40 s (a decrease in time by _60_ times and up), it is able to seek within the
complete set of files for any element of type <element> , </element> and dump its output to a _valid_ xml file with every <element> and its contained data.
unfortunately it does not support attributes so '<screen role="no dump">' "patterns" can escape. It is bash 3.0 compliant only because it uses the "perlish" =~
operator only implemented there. Incorporating attribute - awareness and inline - parsing within genXML would eventually lead to increased complexity especially
when element attributes are laid out in a multi - line manner. This would lead to having more buffers, more variables, more lockups and eventually go to a
situation like (1), making no practical use of the advantage the =~ operator offers.
******************************************
It appears obvious, that by design, the best approach would be to dump character-by-character parsing (1), and make a very simple script that reads the XML
files and does not parse them, but instead, makes so that for a _given_ input, output is either data contained within a _pair_ of < and > characters or not.
Then, "parsing" can be done in that specific output using =~ operators and others alike, while eliminating the need for more complex scripts. I present to you
my newest implementation, going towards that direction.
*****************************************
Two versions of this new script are to be posted:
1. Everything is loaded into a conventional bash array where each entry contains data
within <,> pair or NOT. Nothing is printed on screen.
2. Everything is printed on screen while "parsed" on the fly, nothing is stored
in arrays or anything.
The reason for the two versions is simple: every printf builtin call takes some time. To make an option for printing the resulting array is useless for the time
I get while displaying exceeds 4 (four) min while if I display it on the fly it takes nearly 1 min 20 s. Values under normal operational load. The version
printing on the fly results useful on testing (making sure that no lines are omited during output, counter variables are set right, ecc). So far I think I have
worked out possible pitfalls and bugs, but you never know. Since this is a direction better than before (the silent script parses and loads everything in the
array in less than 50 s). I am also checking out the bash source files to see if there is any peculiar instruction or coding style that may be efficient for
this kind of scripting.
Note that the way the case constructs are laid out make so that bash treats them as && and || "commands", so we are releaved of the need to use many 'if'
structures coupled with && or || operators. This is detrimental to understanding how the script works. Also elements & attributes are preserved in their
entirety and ready for regexp based filtering within the array.
Also, for the ones that want xpointer and related stuff _NOW_ (but the core must be worked out first!) try this for now: ./<scriptname>.sh | grep "xpointer" on
the version of the script that gives printed output. Xinclude and related issues can be easily solved if the entirety of the book is parsed and loaded in a
semantically and topographically meaningful manner in a time schedule of less than 2 (two) m. This is simply a demo, please remember that and bear me. Check the
attachment for various versions.
Take note that inline DTD elements and comments within the xml documents are considered TRASH (of no importance). Check out previous x2sh for entity
dereferencing (it is very quick even in the character - to - character parsing edition, more so on an approach as this).
Average execution times (complete LFS 6.1.1 xml source)
1. silent version: ~ 50 s almost even distribution between user / sys.
2. printing version: ~ 1 min 30 s distribution in favour of user vs sys.
Reducing number of sources and filtering input - types to the script leads to
almost _factor_ decreases in execution time.
All under normal operational load (web browsing, various editors and java proggies running...). Having forced "parsing" for all xml files of the book has made
it easier for me to debug some issues regarding counter variables and string manipulation that can be of use in a more "uninformed" version of the algorithm as
laid out in the script. Thank you for your patience and undestanding. This script will run under both bash 2.x and 3.x versions.
MD5SUM is 6207d36085782fa45b3fb4f2115f8c67 *makeall.tar.bz2
Thank you for hosting my ideas on your mailing list. Waiting for your comments
and bug reports.
George Makrydakis
gmak
#------------------------cut---------------------------------------------
#!/bin/bash
# x2sh booster - for the x2sh component to the jhalfs project
# author: George Makrydakis > gmakmail a|t gmail d0t c0m <
# license: GPL 2.0 or up
# revision: A1-print-nocomment
# instructions: run in the LFS book root
declare -a x2SHraw
declare -a x2SHchapters=(chapter01 \
chapter02 \
chapter03 \
chapter04 \
chapter05 \
chapter06 \
chapter07 \
chapter08 \
chapter09);
declare -i x2SHindex=0
declare -i lcnt=0
declare x2SHfile
declare originalsize
declare otag
declare ctag
declare mpnt1
declare mpnt2
declare srcvar
for x2SHpart in [EMAIL PROTECTED]
do
cd $x2SHpart
for x2SHfile in *.xml
do
x2SHraw=(); lcnt=0;
while read x2SHraw[lcnt]
do
((lcnt++))
done <"$x2SHfile"
for ((lcnt=0; lcnt < [EMAIL PROTECTED]; lcnt++));
do
case ${x2SHraw[lcnt]} in
'')
;;
*)
case ${x2SHraw[lcnt]} in
*\<*)
if [ "${x2SHraw[lcnt]%%<*}" !=
"" ] ; then
printf "%s\n"
"${x2SHraw[lcnt]%%<*}"
fi
;;
*)
if [ "${x2SHraw[lcnt]#>}" =
"${x2SHraw[lcnt]}" ] ; then
printf "%s\n"
"${x2SHraw[lcnt]}"
fi
;;
esac
;;
esac
mpnt1="${x2SHraw[lcnt]}"
mpnt2="${x2SHraw[lcnt]}"
originalsize="${#x2SHraw[lcnt]}"
until [ "$mpnt1" = "${x2SHraw[lcnt]##*<}" ] && \
[ "$mpnt2" = "${x2SHraw[lcnt]##*>}" ] ;
do
mpnt1=${mpnt1#*<}; mpnt2=${mpnt2#*>}
otag=$((originalsize - ${#mpnt1} - 1))
ctag=$((originalsize - ${#mpnt2} - otag))
if [ $ctag -ge 0 ] ; then
printf "%s\n" "${x2SHraw[lcnt]:$otag:$ctag}"
srcvar="$mpnt1"; srcvar="${srcvar#*>}";
srcvar="${srcvar%%<*}"
case "$srcvar" in
'')
;;
*)
printf "%s\n" "$srcvar"
srcvar=""
;;
esac
elif [ $ctag -lt 0 ] ; then
x2SHraw[$((lcnt + 1))]="<""${x2SHraw[lcnt]##*<}""
${x2SHraw[$((lcnt + 1))]}"
break
fi
done
done
done
cd ..
done
#---------------------------------cut-------------------------------------------------------------------------
--
http://linuxfromscratch.org/mailman/listinfo/alfs-discuss
FAQ: http://www.linuxfromscratch.org/faq/
Unsubscribe: See the above information page