Re: Bug in Daffodil - unparsing loses quotes around a CSV field

Sloane, Brandon Sat, 16 Nov 2019 17:55:39 -0800

This looks to me like a bug in the spec to me.

According to the spec: generateEscapeBlock="whenNeeded" generates an escape 
block when any of the following conditions is met:


  *   any in-scope terminating delimiter
  *   dfdl:escapeBlockStart at the start of the data
  *   any dfdl:extraEscapedCharacters

In your schema, %NL; is an infix separator, which does not qualify.

There are 2 simple workarounds I see:

1) Use generateEscapeBlock="always". This is ugly, but should be technically 
correct.
2) Add %LF; to dfdl:extraEscapedCharacters

The reason we we %LF; instead of %NL; is that dfdl:extraEscapedCharacters has 
rather strict restrictions on what is allowed to be used, and %NL; is 
explicitly not permitted. To get the exact same behaviour you would expect from 
%NL; you would need to use "%LF; %CR; %NEL; %LS;" instead, but if you only care 
about UNIX and DOS style line endings, %LF; will suffice.

Below are all of the changes I had to make. Most of these are not directly 
relevent to your question


  *   Added xmlns:fn="http://www.w3.org/2005/xpath-functions"; (I thought this 
was included automatically in TDML files?)
  *   Switch outputNewLine to UNIX style instead of DOS style. This reflects 
the fact that the test cases seem to be in UNIX style (at least on my Linux 
box, something might have translated without us knowing)
  *   Added %LF; to extraEscapedCharacters
  *   Replaced ""Great car"" with \"Great car\"

Regards,
Brandon
________________________________
From: Costello, Roger L. <[email protected]>
Sent: Saturday, November 16, 2019 7:27 AM
To: [email protected] <[email protected]>
Subject: Bug in Daffodil - unparsing loses quotes around a CSV field


Hi Folks,



In CSV a field may span multiple lines if the field is wrapped in double quotes.



I have a CSV record that has a field that spans two lines and the field is 
wrapped in double quotes. Daffodil parses it perfectly but unparsing loses the 
double quotes. See graphic below and see attached TDML file. I believe this is 
a bug. Do you agree? Is there a workaround?  /Roger



[cid:[email protected]]

<?xml version="1.0" encoding="UTF-8"?>
<tdml:testSuite
    suiteName="Bug Report csv.dfdl.xsd" 
    description="Bug in csv.dfdl.xsd"
    xmlns:tdml="http://www.ibm.com/xmlns/dfdl/testData";
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
    xmlns:xml="http://www.w3.org/XML/1998/namespace";
    xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/";
    xmlns:xs="http://www.w3.org/2001/XMLSchema";
    xmlns:fn="http://www.w3.org/2005/xpath-functions";
    xmlns:gpf="http://www.ibm.com/dfdl/GeneralPurposeFormat";
    xmlns:daf="urn:ogf:dfdl:2013:imp:daffodil.apache.org:2018:ext"
    xmlns:ex="http://example.com";
    xsi:schemaLocation="http://www.ibm.com/xmlns/dfdl/testData tdml.xsd"
    defaultRoundTrip="none">
    
    <!--
    This example TDML file is for a self-contained bug report.
   
    It shows the definition of an inline schema and parse test and unparse test that use that schema.
  -->
    
    <!-- 
    A DFDL schema is defined inside the tdml:defineSchema element. The contents
    are similar to a normal DFDL schema, allowing for imports, defining a
    global format via dfdl:defineFormat and dfdl:format, and defining schema
    xs:elements/groups/types/etc.
  -->
    
    <tdml:defineSchema name="CSV-Schema" elementFormDefault="unqualified">
        
        <dfdl:defineFormat name="default-dfdl-properties">
            <dfdl:format 
                alignment="1" 
                alignmentUnits="bytes"  
                binaryFloatRep="ieee" 
                binaryNumberRep="binary"  
                bitOrder="mostSignificantBitFirst"
                byteOrder="bigEndian"  
                calendarPatternKind="implicit"
                choiceLengthKind="implicit"
                documentFinalTerminatorCanBeMissing="yes" 
                emptyValueDelimiterPolicy="none"
                encoding="ISO-8859-1"
                encodingErrorPolicy="replace" 
                escapeSchemeRef=""  
                fillByte="f" 
                floating="no" 
                ignoreCase="no" 
                initiator="" 
                initiatedContent="no" 
                leadingSkip="0" 
                lengthKind="delimited"
                lengthUnits="bytes"  
                nilKind="literalValue"  
                nilValueDelimiterPolicy="none"
                occursCountKind="implicit"
                outputNewLine="%LF;"
                representation="text" 
                separator=""
                separatorPosition="infix"
                separatorSuppressionPolicy="anyEmpty"  
                sequenceKind="ordered" 
                terminator=""   
                textBidi="no" 
                textNumberCheckPolicy="strict"
                textNumberPattern="#,##0.###;-#,##0.###" 
                textNumberRep="standard" 
                textNumberRounding="explicit"  
                textNumberRoundingIncrement="0"
                textNumberRoundingMode="roundUnnecessary" 
                textOutputMinLength="0" 
                textPadKind="none" 
                textStandardBase="10"
                textStandardDecimalSeparator="."
                textStandardExponentRep="E"
                textStandardInfinityRep="Inf"  
                textStandardNaNRep="NaN"
                textStandardZeroRep="0" 
                textStandardGroupingSeparator="," 
                textTrimKind="none" 
                trailingSkip="0" 
                truncateSpecifiedLengthString="no" 
                utf16Width="fixed"   
            />
        </dfdl:defineFormat>
        <dfdl:defineEscapeScheme name="Quotes">
            <dfdl:escapeScheme escapeKind="escapeBlock" 
                escapeBlockStart='"' 
                escapeBlockEnd='"'
                escapeEscapeCharacter="\" 
                extraEscapedCharacters="%LF;"
                generateEscapeBlock="whenNeeded"/>
        </dfdl:defineEscapeScheme>
        <dfdl:defineVariable name="Separator" type="xs:string" external="true">,</dfdl:defineVariable>
        <dfdl:defineVariable name="header" type="xs:string" external="true">present</dfdl:defineVariable>
        <dfdl:defineFormat name="fieldSeparator">
            <dfdl:format separator="{ $Separator }" separatorPosition="infix"/>
        </dfdl:defineFormat>
        
        <dfdl:format ref="ex:default-dfdl-properties" />
        
        <xs:element name="csv">
            <xs:complexType>
                <xs:sequence>
                    <xs:sequence dfdl:separator="%NL;" dfdl:separatorPosition="infix">
                        <xs:element name="header" type="headerType" minOccurs="0" />
                        <xs:element name="record" type="recordType" maxOccurs="unbounded">
                            <xs:annotation>
                                <xs:appinfo source="http://www.ogf.org/dfdl/";>
                                    <dfdl:assert test="{ 
                                        if ($header eq 'present')
                                        then fn:count(field) eq fn:count(../header/title)
                                        else fn:count(field) eq fn:count(../record[1]/field)
                                        }" 
                                        message="{'Each record should contain the same number of fields.'}" />
                                </xs:appinfo>
                            </xs:annotation>
                        </xs:element>
                    </xs:sequence>
                    <xs:sequence dfdl:hiddenGroupRef="hidden-optional-newline" />
                </xs:sequence>
            </xs:complexType>
        </xs:element>
        
        <xs:group name="hidden-optional-newline">    
            <xs:sequence>     
                <xs:element name="EOL" type="xs:string" minOccurs="0"
                    dfdl:initiator="%NL;"  dfdl:lengthKind="explicit" dfdl:length="0" />     
            </xs:sequence>
        </xs:group>
        
        <!-- 
      	Before we try to parse any header fields we check the 
      	$header variable and cause a discriminator to fail if 
      	its value is not "present". 
      -->
        <xs:complexType name="headerType">
            <xs:sequence dfdl:ref="fieldSeparator">
                <xs:annotation>
                    <xs:appinfo source="http://www.ogf.org/dfdl/";>
                        <dfdl:discriminator test="{ $header eq 'present' }" />
                    </xs:appinfo>
                </xs:annotation>
                <xs:element name="title" maxOccurs="unbounded" type="xs:string" />
            </xs:sequence>
        </xs:complexType>
        
        <xs:complexType name="recordType">
            <xs:sequence dfdl:ref="fieldSeparator"  dfdl:separatorSuppressionPolicy="trailingEmptyStrict">
                <xs:element name="field" maxOccurs="unbounded" type="xs:string" nillable="true" dfdl:nilValue="%ES;"
                    dfdl:escapeSchemeRef="Quotes"
                    dfdl:occursCountKind="implicit">
                </xs:element>
            </xs:sequence>
        </xs:complexType>        

    </tdml:defineSchema>
    
    <!--
    Define a parse test case, using the above schema and root element. Input
    data is defined along with the expected infoset.
  -->
    
    <tdml:parserTestCase name="parse-CSV" root="csv" model="CSV-Schema"
        description="Test csv.dfdl.xsd, in the parsing direction">
        
        <tdml:document>
            <tdml:documentPart type="text"
                replaceDFDLEntities="true"><![CDATA[Year,Make,Model,Description,Price
1997,Chevy,E350,"ac, abs, moon",2999.99
1999,Chevy,Venture Extended Edition,,4900.00
1999,Chevy,Venture Extended Edition,Very Large,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.00
2019,Toyota,Avalon,"He said \"Great car\" to the dealer",40000.00
1969,Ford,Mustang,"A classic
car!",1600.00]]></tdml:documentPart>
        </tdml:document>
        
        <tdml:infoset>
            <tdml:dfdlInfoset>
                <ex:csv>
                    <header>
                        <title>Year</title>
                        <title>Make</title>
                        <title>Model</title>
                        <title>Description</title>
                        <title>Price</title>
                    </header>
                    <record>
                        <field>1997</field>
                        <field>Chevy</field>
                        <field>E350</field>
                        <field>ac, abs, moon</field>
                        <field>2999.99</field>
                    </record>
                    <record>
                        <field>1999</field>
                        <field>Chevy</field>
                        <field>Venture Extended Edition</field>
                        <field xsi:nil="true"></field>
                        <field>4900.00</field>
                    </record>
                    <record>
                        <field>1999</field>
                        <field>Chevy</field>
                        <field>Venture Extended Edition</field>
                        <field>Very Large</field>
                        <field>5000.00</field>
                    </record>
                    <record>
                        <field>1996</field>
                        <field>Jeep</field>
                        <field>Grand Cherokee</field>
                        <field>MUST SELL! air, moon roof, loaded</field>
                        <field>4799.00</field>
                    </record>
                    <record>
                        <field>2019</field>
                        <field>Toyota</field>
                        <field>Avalon</field>
                        <field>He said "Great car" to the dealer</field>
                        <field>40000.00</field>
                    </record>
                    <record>
                        <field>1969</field>
                        <field>Ford</field>
                        <field>Mustang</field>
                        <field>A classic
car!</field>
                        <field>1600.00</field>
                    </record>
                </ex:csv>
            </tdml:dfdlInfoset>
        </tdml:infoset>
        
    </tdml:parserTestCase>
    
    <!--
    Define an unparse test case, using the above schema and root element. An
    input infoset is defined along with the expected unparsed data.
  -->
    
    <tdml:unparserTestCase name="unparse-CSV" root="csv" model="CSV-Schema"
        description="Test csv.dfdl.xsd, in the unparsing direction">
        
        <tdml:infoset>
            <tdml:dfdlInfoset>
                <ex:csv>
                    <header>
                        <title>Year</title>
                        <title>Make</title>
                        <title>Model</title>
                        <title>Description</title>
                        <title>Price</title>
                    </header>
                    <record>
                        <field>1997</field>
                        <field>Chevy</field>
                        <field>E350</field>
                        <field>ac, abs, moon</field>
                        <field>2999.99</field>
                    </record>
                    <record>
                        <field>1999</field>
                        <field>Chevy</field>
                        <field>Venture Extended Edition</field>
                        <field xsi:nil="true"></field>
                        <field>4900.00</field>
                    </record>
                    <record>
                        <field>1999</field>
                        <field>Chevy</field>
                        <field>Venture Extended Edition</field>
                        <field>Very Large</field>
                        <field>5000.00</field>
                    </record>
                    <record>
                        <field>1996</field>
                        <field>Jeep</field>
                        <field>Grand Cherokee</field>
                        <field>MUST SELL! air, moon roof, loaded</field>
                        <field>4799.00</field>
                    </record>
                    <record>
                        <field>2019</field>
                        <field>Toyota</field>
                        <field>Avalon</field>
                        <field>He said "Great car" to the dealer</field>
                        <field>40000.00</field>
                    </record>
                    <record>
                        <field>1969</field>
                        <field>Ford</field>
                        <field>Mustang</field>
                        <field>A classic
car!</field>
                        <field>1600.00</field>
                    </record>
                </ex:csv>
            </tdml:dfdlInfoset>
        </tdml:infoset>
        
        <tdml:document>
            <tdml:documentPart type="text"
                replaceDFDLEntities="true"><![CDATA[Year,Make,Model,Description,Price
1997,Chevy,E350,"ac, abs, moon",2999.99
1999,Chevy,Venture Extended Edition,,4900.00
1999,Chevy,Venture Extended Edition,Very Large,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.00
2019,Toyota,Avalon,"He said \"Great car\" to the dealer",40000.00
1969,Ford,Mustang,"A classic
car!",1600.00]]></tdml:documentPart>
        </tdml:document>
        
    </tdml:unparserTestCase>
    
</tdml:testSuite>

Re: Bug in Daffodil - unparsing loses quotes around a CSV field

Reply via email to