Re: RFC-compliant DFDL schema for parsing and unparsing CSV files

Steve Lawrence Mon, 02 Dec 2019 05:27:18 -0800

Ah, I didn't realize that the header parameter was part of the RFC. That
makes sense in that case.


- Steve

On 12/2/19 8:24 AM, Costello, Roger L. wrote:
> Thanks Steve! I'll make the suggested change to encoding. For the header 
> variable, however, the RFC explicitly states the parameter should specify 
> "present" or "absent".
> 
> /Roger
> 
> -----Original Message-----
> From: Steve Lawrence <[email protected]> 
> Sent: Monday, December 2, 2019 8:14 AM
> To: [email protected]
> Subject: [EXT] Re: RFC-compliant DFDL schema for parsing and unparsing CSV 
> files
> 
> Yeah, this looks really nice and should definitely be on the DFDLSchemas 
> repo, with Rogers permission.
> 
> A couple minor comments:
> 
> 1. The DFDL specification predefines a few variables in the dfdl namespace 
> [1], one of which is "dfdl:encoding". This has the same function as your 
> charset variable. It might make sense to use the predefined one instead, so 
> your dfdl:format tag would look like this:
> 
>   <dfdl:format ref="default-dfdl-properties" encoding="{ $dfdl:encoding }" />
> 
> and you wouldn't need to define an extra variable.
> 
> 2. Your "header" variable is used to determine whether or not to parse the 
> header element, and its values should be either "present" or "absent". Since 
> this can really only have two values, it might make sense to change the 
> variable name to "hasHeader" and make it a boolean with either "true" or 
> "false" values.
> 
> 3.  DFDLSchemas isn't part of Apache, so the licensing isn't as strict, but 
> this should still have a license applied to it. Obviously, I'd recommend the 
> Apache License, but ultimately it's up to Roger.
> 
> 
> [1] https://daffodil.apache.org/docs/dfdl/#_Toc398030690
> 
> On 12/1/19 11:36 AM, Sloane, Brandon wrote:
>> Can we add this to the DFDLSchemas/CSV repository?
>> ----------------------------------------------------------------------
>> ----------
>> *From:* Costello, Roger L. <[email protected]>
>> *Sent:* Saturday, November 30, 2019 8:05 AM
>> *To:* [email protected] <[email protected]>
>> *Subject:* RFC-compliant DFDL schema for parsing and unparsing CSV 
>> files Hi Folks,
>>
>> Here is my RFC-compliant DFDL schema for CSV:
>>
>> http://www.xfront.com/DFDL/DFDL-schema-for-CSV.zip
>>
>> Here is a description of my DFDL schema:
>>
>> This DFDL schema describes the CSV file format, as specified in RFC 
>> 4180. The RFC says this:
>>
>>      TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E
>>
>> That means, to be standards-compliant, a CSV file must only contain 
>> printable ASCII characters. That seemed a bit limiting so I contacted 
>> the editor of the RFC, Yakov Shafranovich, and asked him about that. 
>> He kindly responded and said
>> this: The original RFC was set to ASCII  only but when RFC 7111 was 
>> published, the media type was updated to use UTF-8. See:
>> https://lists.w3.org/Archives/Public/public-csv-wg/2014Oct/0115.html
>> https://www.iana.org/assignments/media-types/text/csv
>>
>> The first reference says: While RFC 4180 does mandate ASCII, for 
>> standards purposes this has been changed and the default now is in fact 
>> UTF-8.
>>
>> Okay, so CSV can contain more than just ASCII characters. Phew!
>>
>> The second reference says: The "charset" parameter specifies the 
>> charset employed by the CSV content.
>>
>> Okay, so I parameterized the below DFDL schema: when you run a DFDL 
>> processor on this schema, feed in a value for the charset parameter. 
>> The allowable values are
>> UTF-8 or ASCII (case sensitive!).
>>
>> I just found another RFC for CSV: RFC 7111. Its introduction says 
>> this: This memo updates the text/csv media type defined in RFC 4180 by 
>> defining URI fragment identifiers for text/csv MIME entities.
>>
>> Hmm, I better read that document ... Okay, I read RFC 7111. It doesn't 
>> modify the CSV format, except to say that a charset parameter may be 
>> used to specify the charset employed by the CSV content. The RFC  
>> describes how to reference portions of a CSV file using fragment 
>> identifiers on a URL. That's not relevant to describing the CSV format.
>>
>> Why did I create a DFDL schema for CSV? Last month I was browsing the 
>> Web and came across a web site
>> (http://www.hexacorn.com/blog/2019/09/06/state-machine-vs-regex/)
>> that said something very interesting:
>>
>>          The enlightenment came from reading
>>           the actual CSV specification. When you
>>           skim through it you quickly realize two
>>           things:
>>          1. No one ever reads stuff like this anymore
>>          2. It's unlikely that anyone covers all angles
>>              while saving files in this format
>>                   The result is that we have many badly
>>           implemented CSV parsers out there. You also
>>           realize why: this format is NOT as simple
>>           as many people think. Quite the opposite,
>>           even today, after so many years, even
>>           Excel (which is actually including a lot
>>           of margin for error!) still fails to
>>           process some of these files correctly...
>>
>> After reading that I thought, "Hey, using DFDL I should be able to 
>> write a parser that covers all angles of the CSV file format." And I did!
>>
>> The following DFDL schema precisely describes the CSV data format. 
>> Here's a summary of what this DFDL schema expresses:
>> 1. A CVS file consists of a one or more records separated by newlines.
>> 2. The last record may or may not have an ending newline.
>> 3. A record consists of one or more fields, separated by commas (or 
>> some other symbol).
>> 4. Spaces are considered part of a field and may not be ignored.
>> 5. A CSV file may or may not have a header. If present, it is the 
>> first line. A header consists of one or more names, separated by 
>> commas. The header is separated from the records by a newline.
>> 6. A "header" parameter with value "present" means there is a header, 
>> "absent" 
>> means there is no header.
>> 7. Each record should contain the same number of fields as names in 
>> the header, if present. If the header is not present, then each record 
>> should contain the same number of fields as the other records.
>> 8. A field may be wrapped in double quotes.
>> 9. Commas in a field that is wrapped in double quotes are ignored, 
>> i.e. the commas are not to be treated as field separators.
>> 10. Newlines in a field that is wrapped in double quotes are ignored, 
>> i.e. the newlines are not to be treated as record separators.
>> 11. A double quote within a field must be escaped by a double quote.
>> 12. A "charset" parameter with value "UTF-8" means the file may 
>> contain any
>> UTF-8 character, "ASCII" means the file may only contain ASCII characters.
>>
>

Re: RFC-compliant DFDL schema for parsing and unparsing CSV files

Reply via email to