[Issue 8 drafts 0001649]: Field splitting is woefully under specified, and in places, simply wrong

Austin Group Bug Tracker via austin-group-l at The Open Group Thu, 07 Sep 2023 07:56:23 -0700


A NOTE has been added to this issue. 
====================================================================== 
https://www.austingroupbugs.net/view.php?id=1649 
====================================================================== 
Reported By:                kre
Assigned To:                
====================================================================== 
Project:                    Issue 8 drafts
Issue ID:                   1649
Category:                   Shell and Utilities
Type:                       Error
Severity:                   Objection
Priority:                   normal
Status:                     New
Name:                       Robert Elz 
Organization:                
User Reference:              
Section:                    XCU 2.6.5 
Page Number:                2476 
Line Number:                80478 - 80504 
Final Accepted Text:         
====================================================================== 
Date Submitted:             2023-03-31 01:55 UTC
Last Modified:              2023-09-07 14:54 UTC
====================================================================== 
Summary:                    Field splitting is woefully under specified, and in
places, simply wrong
======================================================================


---------------------------------------------------------------------- 
 (0006465) kre (reporter) - 2023-09-07 14:54
 https://www.austingroupbugs.net/view.php?id=1649#c6465 
---------------------------------------------------------------------- 
Apologies for the mess with the original version of the results, those
reading this via the mailing list will note that I was using angle
brackets
as the field delimiter characters, to show what is in each field, and
totally
forgot that mantis would interpret those, so I have just done a quick
switch
to use square brackets, which works for the note, but is much uglier to
look at.

Anyway, here is the (now using []) shell strawman implementation of the
algorithm in https://www.austingroupbugs.net/view.php?id=1649#c6460 .

Again, this is truncated, most of the actual test cases are omitted,
though
you can deduce what they are from the results in
https://www.austingroupbugs.net/view.php?id=1649#c6464

This test does (or did before I fiddled with the "args()" function which
prints the results just now, produce identical results to the version run
by the shell, in https://www.austingroupbugs.net/view.php?id=1649#c6464 - so I
am not going to include those again.

I have run this with every reaosnable shell I have (not pdksh, and not
zsh,
as I don't really understand its differences).   All of them (mksh
included)
produce the same results here, so I believe the code is portable enough.

# This is a dummy implementation of the proposed field splitting
# algorithm (witten in sh, so hopefully sh people can follow it)
# to demonstrate that the algorithm as presented generates the
# expected output (that generated by almost every shell).

# This code knows that in the tests IFS=' ,' (space and comma)
# and rather than handling that generically, which would be possible,
# but messy, simply builds those two characters (literally) into the
# implementation (space, as a IFS white space char, and comma as an
# IFS char that is not white space).

# Similarly the code "knows" that if there is a prefix in the field
# (chars not to be treated as generated by an expansion, and hence
# exepmt fmom splitting) that will be simply a single 'p' always, and
# siumilarly a suffix will be 'q' - because of that we do not need to
# have any method to indicate what part of the field is to be subject
# to field splitting

# In the following comments that start '##' are text lifted directly
# from my proposed section 2.6.5 ("Field Splitting") text, which might
# allow readers to match this algorithm with what is described there.

# The results from this test match exactly the results from all shells
# considered to operate correctly (the same output routine is used, and
# the results compared with diff - with zero differences).

S=' '
C=','

field_split() {
        ARG=$1          # the field that needs to be split
        set --          # the set of output fields, initially empty

        # IFS is defined (IFS=' ,') and not empty, IFS white space is ' '
        # We simply know that!

        # C is our candidate field,
        # CD indicates the delimiter that terminated the candidate field
        #       ' ' indicates the delimiter was IFS white space alone
        #       ',' indicates the delimuter was a ',' (perhaps with white
space)
        #       '' indicates there has been no delimiter
        C= CD=

        ## Each expansion, or substitution shall be processed in order
        ## as follows [...]

        ## While the input is not empty...
        while test -n "${ARG}"
        do
                ## Consider the first remaining character of the input.
                ## If it is:

                ## a.  A character that did not result from an unquoted
                ##     expansion or substitution:
                ## b.  A character in the input that is not a character in
IFS:

                # since we know exactly what the IFS chars are, and that
                # chars that did not result from an expandion (etc) are
not
                # IFS chars (our test cases ensure that) we don't need to
                # treat those two differently, just skip forward until we
                # get to an IFS char, or we run out, appending the non-IFS
                # chars to the candidate and removing them from the input.

                # here we only care about the current first char in ${ARG}
                while case "${ARG}" in
                        '')     break 2         # the end of the input,
done
                                ;;
                        [\ ,]*) false           # delimiter located, exit
loop
                                ;;
                        *)      TAIL=${ARG#?}   # something else
                                C=${C}${ARG%"${TAIL}"}  # appended to
candidate
                                ARG=${TAIL}             # removed from
input
                                ;;
                      esac
                do
                      :
                done

                # Now we are at the start of a delimiter in ARG, and the
                # candidate field is C

                # which kind of delimiter do we have?

                ## c.  An IFS white space character:

                # assume the delim will be just IFS white space (case 'c')
                CD=' '
                # and then skip any of that we find (repeating 'c' over &
over)
                while case "${ARG}" in
                        ' '*)   ARG=${ARG#* };;
                        *)      false;;
                        esac
                do :; done

                ## d.  Another IFS character, not IFS white space:

                # Next if we have a non white space IFS char,
                # then it is the other kind of delimiter (case 'd' in the
algo)

                case "${ARG}" in
                ,*)     CD=, ; ARG=${ARG#,}   # Remember we saw it, then
remove
                        # and skip any following IFS white space
                        while case "${ARG}" in
                                ' '*)   ARG=${ARG#* };;
                                *)      false;;
                                esac
                        do :; done
                        ;;
                esac

                # now a field has been delimited so we are subject to:

                ## At this point, if the candidate is not empty, or if a
                ## non IFS white space character was seen at step d, then
                ## the candidate becomes an output field.  
                ## In either case, empty the candidate, and perform the
                ## next iteration.

                if test -n "${C}" # candicate is not empty (or...) =>
output
                then
                        ## if the candidate is not empty
                        ## then the candidate becomes an output field.
                        set -- "$@" "'${C}'"

                # otherwise The candidate is empty, if it was delimited
                # by only IFS white space, then candidate is dropped

                elif test "${CD}" != ' '
                then
                        ## or if a non IFS white space character was seen
                        ## then the candidate becomes an output field.
                        set -- "$@" "''"        # no need for $C, it is ""
                fi

                ## In either case, empty the candidate, and perform
                ## the next iteration.

                CD=
                C=
        done

        ## When the input is empty, if the candidate is not empty, it
        ## becomes an output field.

        if test -n "${C}"
        then
                # not an empty field after last delim, so it is included
                set -- "$@" "'${C}'"
        fi

        # return the split field, as a list of quoted words (to become
fields)
        printf %s "$*"
}

args()
{
        name=$1; shift

        printf '%s:\t%d:\t' "$name" "$#"
        printf '[%s]' "$@"
        printf '\n'
}

tst()
{
        N=$1

        eval set -- $(field_split "$2")

        args "$N" "$@"
}

W='abc'
SW=' abc'
WS='abc '
SWS=' abc '
CW=',abc'
WC='abc,'
CWC=',abc,'
WSW='abc def'
WSSW='abd  def'
# and many more definitions like that

# followed by the actual test invocations

tst W "$W"
tst SW "$SW"
tst WS "$WS"
tst SWS "$SWS"
tst CW "$CW"
tst WC "$WC"
tst CWC "$CWC"
tst WSW "$WSW"
tst WSSW "$WSSW"
tst WCW "$WCW"
tst WCCW "$WCCW"
tst WSCW "$WSCW"
tst WCSW "$WCSW"
tst WSCSW "$WSCW"
tst WSCSCSW "$WSCSCSW"

# and many more. 

Issue History 
Date Modified    Username       Field                    Change               
====================================================================== 
2023-03-31 01:55 kre            New Issue                                    
2023-03-31 01:55 kre            File Added: ifs                              
2023-03-31 01:55 kre            Name                      => Robert Elz      
2023-03-31 01:55 kre            Section                   => XCU 2.6.5       
2023-03-31 01:55 kre            Page Number               => 2476            
2023-03-31 01:55 kre            Line Number               => 80478 - 80504   
2023-07-31 16:13 Don Cragun     Note Added: 0006412                          
2023-09-07 14:14 kre            Note Added: 0006459                          
2023-09-07 14:15 kre            Note Added: 0006460                          
2023-09-07 14:30 kre            Note Added: 0006462                          
2023-09-07 14:32 kre            Note Added: 0006463                          
2023-09-07 14:41 kre            Note Deleted: 0006463                        
2023-09-07 14:43 kre            Note Edited: 0006462                         
2023-09-07 14:45 kre            Note Added: 0006464                          
2023-09-07 14:54 kre            Note Added: 0006465                          
======================================================================

[Issue 8 drafts 0001649]: Field splitting is woefully under specified, and in places, simply wrong

Reply via email to