Thanks Steve! I'll make the suggested change to encoding. For the header variable, however, the RFC explicitly states the parameter should specify "present" or "absent".
/Roger -----Original Message----- From: Steve Lawrence <[email protected]> Sent: Monday, December 2, 2019 8:14 AM To: [email protected] Subject: [EXT] Re: RFC-compliant DFDL schema for parsing and unparsing CSV files Yeah, this looks really nice and should definitely be on the DFDLSchemas repo, with Rogers permission. A couple minor comments: 1. The DFDL specification predefines a few variables in the dfdl namespace [1], one of which is "dfdl:encoding". This has the same function as your charset variable. It might make sense to use the predefined one instead, so your dfdl:format tag would look like this: <dfdl:format ref="default-dfdl-properties" encoding="{ $dfdl:encoding }" /> and you wouldn't need to define an extra variable. 2. Your "header" variable is used to determine whether or not to parse the header element, and its values should be either "present" or "absent". Since this can really only have two values, it might make sense to change the variable name to "hasHeader" and make it a boolean with either "true" or "false" values. 3. DFDLSchemas isn't part of Apache, so the licensing isn't as strict, but this should still have a license applied to it. Obviously, I'd recommend the Apache License, but ultimately it's up to Roger. [1] https://daffodil.apache.org/docs/dfdl/#_Toc398030690 On 12/1/19 11:36 AM, Sloane, Brandon wrote: > Can we add this to the DFDLSchemas/CSV repository? > ---------------------------------------------------------------------- > ---------- > *From:* Costello, Roger L. <[email protected]> > *Sent:* Saturday, November 30, 2019 8:05 AM > *To:* [email protected] <[email protected]> > *Subject:* RFC-compliant DFDL schema for parsing and unparsing CSV > files Hi Folks, > > Here is my RFC-compliant DFDL schema for CSV: > > http://www.xfront.com/DFDL/DFDL-schema-for-CSV.zip > > Here is a description of my DFDL schema: > > This DFDL schema describes the CSV file format, as specified in RFC > 4180. The RFC says this: > > TEXTDATA = %x20-21 / %x23-2B / %x2D-7E > > That means, to be standards-compliant, a CSV file must only contain > printable ASCII characters. That seemed a bit limiting so I contacted > the editor of the RFC, Yakov Shafranovich, and asked him about that. > He kindly responded and said > this: The original RFC was set to ASCII only but when RFC 7111 was > published, the media type was updated to use UTF-8. See: > https://lists.w3.org/Archives/Public/public-csv-wg/2014Oct/0115.html > https://www.iana.org/assignments/media-types/text/csv > > The first reference says: While RFC 4180 does mandate ASCII, for > standards purposes this has been changed and the default now is in fact UTF-8. > > Okay, so CSV can contain more than just ASCII characters. Phew! > > The second reference says: The "charset" parameter specifies the > charset employed by the CSV content. > > Okay, so I parameterized the below DFDL schema: when you run a DFDL > processor on this schema, feed in a value for the charset parameter. > The allowable values are > UTF-8 or ASCII (case sensitive!). > > I just found another RFC for CSV: RFC 7111. Its introduction says > this: This memo updates the text/csv media type defined in RFC 4180 by > defining URI fragment identifiers for text/csv MIME entities. > > Hmm, I better read that document ... Okay, I read RFC 7111. It doesn't > modify the CSV format, except to say that a charset parameter may be > used to specify the charset employed by the CSV content. The RFC > describes how to reference portions of a CSV file using fragment > identifiers on a URL. That's not relevant to describing the CSV format. > > Why did I create a DFDL schema for CSV? Last month I was browsing the > Web and came across a web site > (http://www.hexacorn.com/blog/2019/09/06/state-machine-vs-regex/) > that said something very interesting: > > The enlightenment came from reading > the actual CSV specification. When you > skim through it you quickly realize two > things: > 1. No one ever reads stuff like this anymore > 2. It's unlikely that anyone covers all angles > while saving files in this format > The result is that we have many badly > implemented CSV parsers out there. You also > realize why: this format is NOT as simple > as many people think. Quite the opposite, > even today, after so many years, even > Excel (which is actually including a lot > of margin for error!) still fails to > process some of these files correctly... > > After reading that I thought, "Hey, using DFDL I should be able to > write a parser that covers all angles of the CSV file format." And I did! > > The following DFDL schema precisely describes the CSV data format. > Here's a summary of what this DFDL schema expresses: > 1. A CVS file consists of a one or more records separated by newlines. > 2. The last record may or may not have an ending newline. > 3. A record consists of one or more fields, separated by commas (or > some other symbol). > 4. Spaces are considered part of a field and may not be ignored. > 5. A CSV file may or may not have a header. If present, it is the > first line. A header consists of one or more names, separated by > commas. The header is separated from the records by a newline. > 6. A "header" parameter with value "present" means there is a header, > "absent" > means there is no header. > 7. Each record should contain the same number of fields as names in > the header, if present. If the header is not present, then each record > should contain the same number of fields as the other records. > 8. A field may be wrapped in double quotes. > 9. Commas in a field that is wrapped in double quotes are ignored, > i.e. the commas are not to be treated as field separators. > 10. Newlines in a field that is wrapped in double quotes are ignored, > i.e. the newlines are not to be treated as record separators. > 11. A double quote within a field must be escaped by a double quote. > 12. A "charset" parameter with value "UTF-8" means the file may > contain any > UTF-8 character, "ASCII" means the file may only contain ASCII characters. >
