Re: RFC-compliant DFDL schema for parsing and unparsing CSV files

Steve Lawrence Mon, 02 Dec 2019 05:14:13 -0800

Yeah, this looks really nice and should definitely be on the DFDLSchemas
repo, with Rogers permission.


A couple minor comments:

1. The DFDL specification predefines a few variables in the dfdl
namespace [1], one of which is "dfdl:encoding". This has the same
function as your charset variable. It might make sense to use the
predefined one instead, so your dfdl:format tag would look like this:

  <dfdl:format ref="default-dfdl-properties" encoding="{ $dfdl:encoding
}" />

and you wouldn't need to define an extra variable.

2. Your "header" variable is used to determine whether or not to parse
the header element, and its values should be either "present" or
"absent". Since this can really only have two values, it might make
sense to change the variable name to "hasHeader" and make it a boolean
with either "true" or "false" values.

3.  DFDLSchemas isn't part of Apache, so the licensing isn't as strict,
but this should still have a license applied to it. Obviously, I'd
recommend the Apache License, but ultimately it's up to Roger.


[1] https://daffodil.apache.org/docs/dfdl/#_Toc398030690

On 12/1/19 11:36 AM, Sloane, Brandon wrote:
> Can we add this to the DFDLSchemas/CSV repository?
> --------------------------------------------------------------------------------
> *From:* Costello, Roger L. <[email protected]>
> *Sent:* Saturday, November 30, 2019 8:05 AM
> *To:* [email protected] <[email protected]>
> *Subject:* RFC-compliant DFDL schema for parsing and unparsing CSV files
> Hi Folks,
> 
> Here is my RFC-compliant DFDL schema for CSV:
> 
> http://www.xfront.com/DFDL/DFDL-schema-for-CSV.zip
> 
> Here is a description of my DFDL schema:
> 
> This DFDL schema describes the CSV file format, as specified in RFC 4180. The 
> RFC says this:
> 
>      TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E
> 
> That means, to be standards-compliant, a CSV file must only contain printable 
> ASCII characters. That seemed a bit limiting so I contacted the editor of the 
> RFC, Yakov Shafranovich, and asked him about that. He kindly responded and 
> said 
> this: The original RFC was set to ASCII  only but when RFC 7111 was 
> published, 
> the media type was updated to use UTF-8. See:
> https://lists.w3.org/Archives/Public/public-csv-wg/2014Oct/0115.html
> https://www.iana.org/assignments/media-types/text/csv
> 
> The first reference says: While RFC 4180 does mandate ASCII, for standards 
> purposes this has been changed and the default now is in fact UTF-8.
> 
> Okay, so CSV can contain more than just ASCII characters. Phew!
> 
> The second reference says: The "charset" parameter specifies the charset 
> employed by the CSV content.
> 
> Okay, so I parameterized the below DFDL schema: when you run a DFDL processor 
> on 
> this schema, feed in a value for the charset parameter. The allowable values 
> are 
> UTF-8 or ASCII (case sensitive!).
> 
> I just found another RFC for CSV: RFC 7111. Its introduction says this: This 
> memo updates the text/csv media type defined in RFC 4180 by defining URI 
> fragment identifiers for text/csv MIME entities.
> 
> Hmm, I better read that document ... Okay, I read RFC 7111. It doesn't modify 
> the CSV format, except to say that a charset parameter may be used to specify 
> the charset employed by the CSV content. The RFC  describes how to reference 
> portions of a CSV file using fragment identifiers on a URL. That's not 
> relevant 
> to describing the CSV format.
> 
> Why did I create a DFDL schema for CSV? Last month I was browsing the Web and 
> came across a web site
> (http://www.hexacorn.com/blog/2019/09/06/state-machine-vs-regex/)
> that said something very interesting:
> 
>          The enlightenment came from reading
>           the actual CSV specification. When you
>           skim through it you quickly realize two
>           things:
>          1. No one ever reads stuff like this anymore
>          2. It's unlikely that anyone covers all angles
>              while saving files in this format
>                   The result is that we have many badly
>           implemented CSV parsers out there. You also
>           realize why: this format is NOT as simple
>           as many people think. Quite the opposite,
>           even today, after so many years, even
>           Excel (which is actually including a lot
>           of margin for error!) still fails to
>           process some of these files correctly...
> 
> After reading that I thought, "Hey, using DFDL I should be able to write a 
> parser that covers all angles of the CSV file format." And I did!
> 
> The following DFDL schema precisely describes the CSV data format. Here's a 
> summary of what this DFDL schema expresses:
> 1. A CVS file consists of a one or more records separated by newlines.
> 2. The last record may or may not have an ending newline.
> 3. A record consists of one or more fields, separated by commas (or some 
> other 
> symbol).
> 4. Spaces are considered part of a field and may not be ignored.
> 5. A CSV file may or may not have a header. If present, it is the first line. 
> A 
> header consists of one or more names, separated by commas. The header is 
> separated from the records by a newline.
> 6. A "header" parameter with value "present" means there is a header, 
> "absent" 
> means there is no header.
> 7. Each record should contain the same number of fields as names in the 
> header, 
> if present. If the header is not present, then each record should contain the 
> same number of fields as the other records.
> 8. A field may be wrapped in double quotes.
> 9. Commas in a field that is wrapped in double quotes are ignored, i.e. the 
> commas are not to be treated as field separators.
> 10. Newlines in a field that is wrapped in double quotes are ignored, i.e. 
> the 
> newlines are not to be treated as record separators.
> 11. A double quote within a field must be escaped by a double quote.
> 12. A "charset" parameter with value "UTF-8" means the file may contain any 
> UTF-8 character, "ASCII" means the file may only contain ASCII characters.
>

Re: RFC-compliant DFDL schema for parsing and unparsing CSV files

Reply via email to