Thank you very much. I converted your benchmark to my Smalltalk dialect and was pleased with the results. This gave me the impetus I needed to implement the #recordClass: feature of NeoCSVReader, although in my case it requires the class to implement #withAll: and the operand is a (reused) OrderedCollection.
There's one difference between CSVEncoder and NeoCSVWriter that might be of interest: you can't tell CSVEncoder whether a field is #raw or #quoted because it always figures that out for itself. I was prepared to pay an efficiency penalty to make sure I did not get this wrong, and am pleased to find it wasn't as much of a penalty as I feared. On Wed, 6 Jan 2021 at 22:52, Sven Van Caekenberghe <s...@stfx.eu> wrote: > Hi Richard, > > Benchmarking is a can of worms, many factors have to be considered. But > the first requirement is obviously to be completely open over what you are > doing and what you are comparing. > > NeoCSV contains a simple benchmark suite called NeoCSVBenchmark, which was > used during development. Note that it is a bit tricky to use: you need to > run a write benchmark with a specific configuration before you can try read > benchmarks. > > The core data is a 100.000 line file (2.5 MB) like this: > > 1,-1,99999 > 2,-2,99998 > 3,-3,99997 > 4,-4,99996 > 5,-5,99995 > 6,-6,99994 > 7,-7,99993 > 8,-8,99992 > 9,-9,99991 > 10,-10,99990 > ... > > That parses in ~250ms on my machine. > > NeoCSV has quite a bit of features and handles various edge cases. > Obviously, a minimal, custom implementation could be faster. > > NeoCSV is called efficient not just because it is reasonably fast, but > because it can be configured to generate domain objects without > intermediate structures and because it can convert individual fields (parse > numbers, dates, times, ...) while parsing. > > Like you said, some generated CSV output out in the wild is very > irregular. I try to stick with standard CSV as much as possible. > > Sven > > > On 6 Jan 2021, at 05:10, Richard O'Keefe <rao...@gmail.com> wrote: > > > > NeoCSVReader is described as efficient. What is that > > in comparison to? What benchmark data are used? > > Here are benchmark results measured today. > > (5,000 data line file, 9,145,009 characters). > > method time(ms) > > Just read characters 410 > > CSVDecoder>>next 3415 astc's CSV reader (defaults). 1.26 x > CSVParser > > NeoCSVReader>>next 4798 NeoCSVReader (default state). 1.78 x > CSVParser > > CSVParser>>next 2701 pared-to-the-bone CSV reader. 1.00 > reference. > > > > (10,000 data line file, 1,544,836 characters). > > method time(ms) > > Just read characters 93 > > CSVDecoder>>next 530 astc's CSV reader (defaults). 1.26 x > CSVParser > > NeoCSVReader>>next 737 NeoCSVReader (default state). 1.75 x > CSVParser > > CSVParser>>next 421 pared-to-the-bone CSV reader. 1.00 > reference. > > > > CSVParser is just 78 lines and is not customisable. It really is > > stripped to pretty much an absolute minimum. All of the parsers > > were configured (if that made sense) to return an Array of Strings. > > Many of the CSV files I've worked with use short records instead > > of ending a line with a lot of commas. Some of them also have the > occasional stray comment off to the right, not mentioned in the header. > > I've also found it necessary to skip multiple lines at the beginning > > and/or end. (Really, some government agencies seem to have NO idea > > that anyone might want to do more with a CSV file than eyeball it in > > Excel.) > > > > If there is a benchmark suite I can use to improve CSVDecoder, > > I would like to try it out. > > > > On Tue, 5 Jan 2021 at 02:36, jtuc...@objektfabrik.de < > jtuc...@objektfabrik.de> wrote: > > Happy new year to all of you! May 2021 be an increasingly less crazy > > year than 2020... > > > > > > I have a question that sounds a bit strange, but we have two effects > > with NeoCSVReader related to wrong definitions of the reader. > > > > One effect is that reading a Stream #upToEnd leads to an endless loop, > > the other is that the Reader produces twice as many objects as there are > > lines in the file that is being read. > > > > In both scenarios, the reason is that the CSV Reader has a wrong number > > of column definitions. > > > > Of course that is my fault: why do I feed a "malformed" CSV file to poor > > NeoCSVReader? > > > > Let me explain: we have a few import interfaces which end users can > > define using a more or less nice assistant in our Application. The CSV > > files they upload to our App come from third parties like payment > > providers, banks and other sources. These change their file structures > > whenever they feel like it and never tell anybody. So a CSV import that > > may have been working for years may one day tear a whole web server > > image down because of a wrong number of fieldAccessors. This is bad on > > many levels. > > > > You can easily try the doubling effect at home: define a working CSV > > Reader and comment out one of the addField: commands before you use the > > NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 > > columns each. If you remove one of the fieldAccessors, an #upToEnd will > > yoield an Array of 6 objects rather than 3. > > > > I haven't found the reason for the cases where this leads to an endless > > loop, but at least this one is clear... > > > > I *guess* this is due to the way #readEndOfLine is implemented. It seems > > to not peek forward to the end of the line. I have the gut feeling > > #peekChar should peek instead of reading the #next character form the > > input Stream, but #peekChar has too many senders to just go ahead and > > mess with it ;-) > > > > So I wonder if there are any tried approaches to this problem. > > > > One thing I might do is not use #upToEnd, but read each line using > > PositionableStream>>#nextLine and first check each line if the number of > > separators matches the number of fieldAccessors minus 1 (and go through > > the hoops of handling separators in quoted fields and such...). Only if > > that test succeeds, I would then hand a Stream with the whole line to > > the reader and do a #next. > > > > This will, however, mean a lot of extra cycles for large files. Of > > course I could do this only for some lines, maybe just the first one. > > Whatever. > > > > > > But somehow I have the feeling I should get an exception telling me the > > line is not compatible to the Reader's definition or such. Or > > #readAtEndOrEndOfLine should just walk the line to the end and ignore > > the rest of the line, returnong an incomplete object.... > > > > > > Maybe I am just missing the right setting or switch? What best practices > > did you guys come up with for such problems? > > > > > > Thanks in advance, > > > > > > Joachim > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >