x2sh boosting process - revised from the core and up!

George Makrydakis Thu, 23 Feb 2006 10:19:16 -0800

Hello Everybody,

Some reviewing and a _NEW_ post with - closer to a solution of "our" "problem".

1. The x2sh library snapshot I posted is completely agnostic of what it parses, but it is slow when used in a big series of documents. I have tried this myself,in order to parse the entire LFS book and load it into a semantically significant "table" simulation using arrays in bash takes at about 40 - 50 min. This isbecause the script code is complex, requires a lot of counter variables while parsing, but most of all, it reads everything in the xml source it parsescharacter by character. This is completely unacceptable under _any_ circumstance. Fact is, that however "weird" the syntax in the file may be, because of itscharacter it would parse literally anything.

2. The genXML boosting parser script is extremely fast, for it takes only 30 - 40 s (a decrease in time by _60_ times and up), it is able to seek within thecomplete set of files for any element of type <element> , </element> and dump its output to a _valid_ xml file with every <element> and its contained data.unfortunately it does not support attributes so '<screen role="no dump">' "patterns" can escape. It is bash 3.0 compliant only because it uses the "perlish" =~operator only implemented there. Incorporating attribute - awareness and inline - parsing within genXML would eventually lead to increased complexity especiallywhen element attributes are laid out in a multi - line manner. This would lead to having more buffers, more variables, more lockups and eventually go to asituation like (1), making no practical use of the advantage the =~ operator offers.




******************************************

It appears obvious, that by design, the best approach would be to dump character-by-character parsing (1), and make a very simple script that reads the XMLfiles and does not parse them, but instead, makes so that for a _given_ input, output is either data contained within a _pair_ of < and > characters or not.Then, "parsing" can be done in that specific output using =~ operators and others alike, while eliminating the need for more complex scripts. I present to youmy newest implementation, going towards that direction.


*****************************************


Two versions of this new script are to be posted:

1. Everything is loaded into a conventional bash array where each entry contains data 
within <,> pair or NOT. Nothing is printed on screen.
2. Everything is printed on screen while "parsed" on the fly, nothing is stored 
in arrays or anything.

The reason for the two versions is simple: every printf builtin call takes some time. To make an option for printing the resulting array is useless for the timeI get while displaying exceeds 4 (four) min while if I display it on the fly it takes nearly 1 min 20 s. Values under normal operational load. The versionprinting on the fly results useful on testing (making sure that no lines are omited during output, counter variables are set right, ecc). So far I think I haveworked out possible pitfalls and bugs, but you never know. Since this is a direction better than before (the silent script parses and loads everything in thearray in less than 50 s). I am also checking out the bash source files to see if there is any peculiar instruction or coding style that may be efficient forthis kind of scripting.

Note that the way the case constructs are laid out make so that bash treats them as && and || "commands", so we are releaved of the need to use many 'if'structures coupled with && or || operators. This is detrimental to understanding how the script works. Also elements & attributes are preserved in theirentirety and ready for regexp based filtering within the array.

Also, for the ones that want xpointer and related stuff _NOW_ (but the core must be worked out first!) try this for now: ./<scriptname>.sh | grep "xpointer" onthe version of the script that gives printed output. Xinclude and related issues can be easily solved if the entirety of the book is parsed and loaded in asemantically and topographically meaningful manner in a time schedule of less than 2 (two) m. This is simply a demo, please remember that and bear me. Check theattachment for various versions.

Take note that inline DTD elements and comments within the xml documents are considered TRASH (of no importance). Check out previous x2sh for entitydereferencing (it is very quick even in the character - to - character parsing edition, more so on an approach as this).



Average execution times (complete LFS 6.1.1 xml source)

1. silent version: ~ 50 s almost even distribution between user / sys.
2. printing version: ~ 1 min 30 s distribution in favour of user vs sys.

Reducing number of sources and filtering input - types to the script leads to 
almost _factor_ decreases in execution time.

All under normal operational load (web browsing, various editors and java proggies running...). Having forced "parsing" for all xml files of the book has madeit easier for me to debug some issues regarding counter variables and string manipulation that can be of use in a more "uninformed" version of the algorithm aslaid out in the script. Thank you for your patience and undestanding. This script will run under both bash 2.x and 3.x versions.


MD5SUM is 6207d36085782fa45b3fb4f2115f8c67 *makeall.tar.bz2



Thank you for hosting my ideas on your mailing list. Waiting for your comments 
and bug reports.

George Makrydakis

gmak


#------------------------cut---------------------------------------------


#!/bin/bash

# x2sh booster - for the x2sh component to the jhalfs project
# author: George Makrydakis > gmakmail a|t gmail d0t c0m <
# license: GPL 2.0 or up
# revision: A1-print-nocomment
# instructions: run in the LFS book root

        declare -a x2SHraw
        declare -a x2SHchapters=(chapter01 \
                                chapter02 \
                                chapter03 \
                                chapter04 \
                                chapter05 \
                                chapter06 \
                                chapter07 \
                                chapter08 \
                                chapter09);
                                
        declare -i x2SHindex=0
        declare -i lcnt=0

        declare  x2SHfile
        declare  originalsize

        declare otag
        declare ctag
        declare mpnt1
        declare mpnt2
        declare srcvar

        for x2SHpart in [EMAIL PROTECTED]
        do
                cd $x2SHpart
        for x2SHfile in *.xml
        do
                x2SHraw=(); lcnt=0;
                while read x2SHraw[lcnt]
                do
                        ((lcnt++))
                done <"$x2SHfile"

        for ((lcnt=0; lcnt < [EMAIL PROTECTED]; lcnt++));
        do
                case ${x2SHraw[lcnt]} in
                        '')
                        ;;
                        *)
                                case ${x2SHraw[lcnt]} in
                                        *\<*)
                                                if [  "${x2SHraw[lcnt]%%<*}" != 
"" ] ; then
                                                        printf "%s\n" 
"${x2SHraw[lcnt]%%<*}"
                                                fi
                                        ;;
                                        *)
                                                if [ "${x2SHraw[lcnt]#>}" = 
"${x2SHraw[lcnt]}" ] ; then
                                                        printf "%s\n" 
"${x2SHraw[lcnt]}"
                                                fi
                                        ;;
                                esac

                        ;;
                esac

                mpnt1="${x2SHraw[lcnt]}"
                mpnt2="${x2SHraw[lcnt]}"
                originalsize="${#x2SHraw[lcnt]}"

                until [ "$mpnt1" = "${x2SHraw[lcnt]##*<}" ] && \
                      [ "$mpnt2" = "${x2SHraw[lcnt]##*>}" ] ;    
                do
                        mpnt1=${mpnt1#*<}; mpnt2=${mpnt2#*>}
                        otag=$((originalsize - ${#mpnt1} - 1))
                        ctag=$((originalsize - ${#mpnt2} - otag))
                        if [ $ctag -ge 0 ] ; then
                                printf "%s\n" "${x2SHraw[lcnt]:$otag:$ctag}"
                                srcvar="$mpnt1"; srcvar="${srcvar#*>}"; 
srcvar="${srcvar%%<*}"
                                case "$srcvar" in
                                        '')
                                        ;;
                                        *)
                                                printf "%s\n" "$srcvar"
                                                srcvar=""
                                        ;;
                                        
                                esac
                        elif [ $ctag -lt 0 ] ; then
                                x2SHraw[$((lcnt + 1))]="<""${x2SHraw[lcnt]##*<}"" 
${x2SHraw[$((lcnt + 1))]}"
                                break
                        fi
                done
        done
done
cd ..
done


#---------------------------------cut-------------------------------------------------------------------------

--
http://linuxfromscratch.org/mailman/listinfo/alfs-discuss
FAQ: http://www.linuxfromscratch.org/faq/
Unsubscribe: See the above information page

x2sh boosting process - revised from the core and up!

Reply via email to