Could you put a comment about the RFC above the header field. This is probably going to end up being one of the first examples of DFDL many people look at. I don't want them to get the impression that type="xs:string" is the only type we support. ________________________________ From: Steve Lawrence <[email protected]> Sent: Monday, December 2, 2019 8:26 AM To: [email protected] <[email protected]> Subject: Re: RFC-compliant DFDL schema for parsing and unparsing CSV files
Ah, I didn't realize that the header parameter was part of the RFC. That makes sense in that case. - Steve On 12/2/19 8:24 AM, Costello, Roger L. wrote: > Thanks Steve! I'll make the suggested change to encoding. For the header > variable, however, the RFC explicitly states the parameter should specify > "present" or "absent". > > /Roger > > -----Original Message----- > From: Steve Lawrence <[email protected]> > Sent: Monday, December 2, 2019 8:14 AM > To: [email protected] > Subject: [EXT] Re: RFC-compliant DFDL schema for parsing and unparsing CSV > files > > Yeah, this looks really nice and should definitely be on the DFDLSchemas > repo, with Rogers permission. > > A couple minor comments: > > 1. The DFDL specification predefines a few variables in the dfdl namespace > [1], one of which is "dfdl:encoding". This has the same function as your > charset variable. It might make sense to use the predefined one instead, so > your dfdl:format tag would look like this: > > <dfdl:format ref="default-dfdl-properties" encoding="{ $dfdl:encoding }" /> > > and you wouldn't need to define an extra variable. > > 2. Your "header" variable is used to determine whether or not to parse the > header element, and its values should be either "present" or "absent". Since > this can really only have two values, it might make sense to change the > variable name to "hasHeader" and make it a boolean with either "true" or > "false" values. > > 3. DFDLSchemas isn't part of Apache, so the licensing isn't as strict, but > this should still have a license applied to it. Obviously, I'd recommend the > Apache License, but ultimately it's up to Roger. > > > [1] https://daffodil.apache.org/docs/dfdl/#_Toc398030690 > > On 12/1/19 11:36 AM, Sloane, Brandon wrote: >> Can we add this to the DFDLSchemas/CSV repository? >> ---------------------------------------------------------------------- >> ---------- >> *From:* Costello, Roger L. <[email protected]> >> *Sent:* Saturday, November 30, 2019 8:05 AM >> *To:* [email protected] <[email protected]> >> *Subject:* RFC-compliant DFDL schema for parsing and unparsing CSV >> files Hi Folks, >> >> Here is my RFC-compliant DFDL schema for CSV: >> >> http://www.xfront.com/DFDL/DFDL-schema-for-CSV.zip >> >> Here is a description of my DFDL schema: >> >> This DFDL schema describes the CSV file format, as specified in RFC >> 4180. The RFC says this: >> >> TEXTDATA = %x20-21 / %x23-2B / %x2D-7E >> >> That means, to be standards-compliant, a CSV file must only contain >> printable ASCII characters. That seemed a bit limiting so I contacted >> the editor of the RFC, Yakov Shafranovich, and asked him about that. >> He kindly responded and said >> this: The original RFC was set to ASCII only but when RFC 7111 was >> published, the media type was updated to use UTF-8. See: >> https://lists.w3.org/Archives/Public/public-csv-wg/2014Oct/0115.html >> https://www.iana.org/assignments/media-types/text/csv >> >> The first reference says: While RFC 4180 does mandate ASCII, for >> standards purposes this has been changed and the default now is in fact >> UTF-8. >> >> Okay, so CSV can contain more than just ASCII characters. Phew! >> >> The second reference says: The "charset" parameter specifies the >> charset employed by the CSV content. >> >> Okay, so I parameterized the below DFDL schema: when you run a DFDL >> processor on this schema, feed in a value for the charset parameter. >> The allowable values are >> UTF-8 or ASCII (case sensitive!). >> >> I just found another RFC for CSV: RFC 7111. Its introduction says >> this: This memo updates the text/csv media type defined in RFC 4180 by >> defining URI fragment identifiers for text/csv MIME entities. >> >> Hmm, I better read that document ... Okay, I read RFC 7111. It doesn't >> modify the CSV format, except to say that a charset parameter may be >> used to specify the charset employed by the CSV content. The RFC >> describes how to reference portions of a CSV file using fragment >> identifiers on a URL. That's not relevant to describing the CSV format. >> >> Why did I create a DFDL schema for CSV? Last month I was browsing the >> Web and came across a web site >> (http://www.hexacorn.com/blog/2019/09/06/state-machine-vs-regex/) >> that said something very interesting: >> >> The enlightenment came from reading >> the actual CSV specification. When you >> skim through it you quickly realize two >> things: >> 1. No one ever reads stuff like this anymore >> 2. It's unlikely that anyone covers all angles >> while saving files in this format >> The result is that we have many badly >> implemented CSV parsers out there. You also >> realize why: this format is NOT as simple >> as many people think. Quite the opposite, >> even today, after so many years, even >> Excel (which is actually including a lot >> of margin for error!) still fails to >> process some of these files correctly... >> >> After reading that I thought, "Hey, using DFDL I should be able to >> write a parser that covers all angles of the CSV file format." And I did! >> >> The following DFDL schema precisely describes the CSV data format. >> Here's a summary of what this DFDL schema expresses: >> 1. A CVS file consists of a one or more records separated by newlines. >> 2. The last record may or may not have an ending newline. >> 3. A record consists of one or more fields, separated by commas (or >> some other symbol). >> 4. Spaces are considered part of a field and may not be ignored. >> 5. A CSV file may or may not have a header. If present, it is the >> first line. A header consists of one or more names, separated by >> commas. The header is separated from the records by a newline. >> 6. A "header" parameter with value "present" means there is a header, >> "absent" >> means there is no header. >> 7. Each record should contain the same number of fields as names in >> the header, if present. If the header is not present, then each record >> should contain the same number of fields as the other records. >> 8. A field may be wrapped in double quotes. >> 9. Commas in a field that is wrapped in double quotes are ignored, >> i.e. the commas are not to be treated as field separators. >> 10. Newlines in a field that is wrapped in double quotes are ignored, >> i.e. the newlines are not to be treated as record separators. >> 11. A double quote within a field must be escaped by a double quote. >> 12. A "charset" parameter with value "UTF-8" means the file may >> contain any >> UTF-8 character, "ASCII" means the file may only contain ASCII characters. >> >
