[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

Richard O'Keefe Thu, 07 Jan 2021 00:55:15 -0800

Thank you very much.
I converted your benchmark to my Smalltalk dialect and was
pleased with the results.  This gave me the impetus I needed
to implement the #recordClass: feature of NeoCSVReader,
although in my case it requires the class to implement #withAll:
and the operand is a (reused) OrderedCollection.


There's one difference between CSVEncoder and NeoCSVWriter that
might be of interest: you can't tell CSVEncoder whether a field
is #raw or #quoted because it always figures that out for itself.
I was prepared to pay an efficiency penalty to make sure I did not
get this wrong, and am pleased to find it wasn't as much of a
penalty as I feared.



On Wed, 6 Jan 2021 at 22:52, Sven Van Caekenberghe <s...@stfx.eu> wrote:

> Hi Richard,
>
> Benchmarking is a can of worms, many factors have to be considered. But
> the first requirement is obviously to be completely open over what you are
> doing and what you are comparing.
>
> NeoCSV contains a simple benchmark suite called NeoCSVBenchmark, which was
> used during development. Note that it is a bit tricky to use: you need to
> run a write benchmark with a specific configuration before you can try read
> benchmarks.
>
> The core data is a 100.000 line file (2.5 MB) like this:
>
> 1,-1,99999
> 2,-2,99998
> 3,-3,99997
> 4,-4,99996
> 5,-5,99995
> 6,-6,99994
> 7,-7,99993
> 8,-8,99992
> 9,-9,99991
> 10,-10,99990
> ...
>
> That parses in ~250ms on my machine.
>
> NeoCSV has quite a bit of features and handles various edge cases.
> Obviously, a minimal, custom implementation could be faster.
>
> NeoCSV is called efficient not just because it is reasonably fast, but
> because it can be configured to generate domain objects without
> intermediate structures and because it can convert individual fields (parse
> numbers, dates, times, ...) while parsing.
>
> Like you said, some generated CSV output out in the wild is very
> irregular. I try to stick with standard CSV as much as possible.
>
> Sven
>
> > On 6 Jan 2021, at 05:10, Richard O'Keefe <rao...@gmail.com> wrote:
> >
> > NeoCSVReader is described as efficient.  What is that
> > in comparison to?  What benchmark data are used?
> > Here are benchmark results measured today.
> > (5,000 data line file, 9,145,009 characters).
> >  method                time(ms)
> >  Just read characters   410
> >  CSVDecoder>>next      3415   astc's CSV reader (defaults). 1.26 x
> CSVParser
> >  NeoCSVReader>>next    4798   NeoCSVReader (default state). 1.78 x
> CSVParser
> >  CSVParser>>next       2701   pared-to-the-bone CSV reader. 1.00
> reference.
> >
> > (10,000 data line file, 1,544,836 characters).
> >  method                time(ms)
> >  Just read characters    93
> >  CSVDecoder>>next       530   astc's CSV reader (defaults). 1.26 x
> CSVParser
> >  NeoCSVReader>>next     737   NeoCSVReader (default state). 1.75 x
> CSVParser
> >  CSVParser>>next        421   pared-to-the-bone CSV reader. 1.00
> reference.
> >
> > CSVParser is just 78 lines and is not customisable.  It really is
> > stripped to pretty much an absolute minimum.  All of the parsers
> > were configured (if that made sense) to return an Array of Strings.
> > Many of the CSV files I've worked with use short records instead
> > of ending a line with a lot of commas.  Some of them also have the
> occasional stray comment off to the right, not mentioned in the header.
> > I've also found it necessary to skip multiple lines at the beginning
> > and/or end.  (Really, some government agencies seem to have NO idea
> > that anyone might want to do more with a CSV file than eyeball it in
> > Excel.)
> >
> > If there is a benchmark suite I can use to improve CSVDecoder,
> > I would like to try it out.
> >
> > On Tue, 5 Jan 2021 at 02:36, jtuc...@objektfabrik.de <
> jtuc...@objektfabrik.de> wrote:
> > Happy new year to all of you! May 2021 be an increasingly less crazy
> > year than 2020...
> >
> >
> > I have a question that sounds a bit strange, but we have two effects
> > with NeoCSVReader related to wrong definitions of the reader.
> >
> > One effect is that reading a Stream #upToEnd leads to an endless loop,
> > the other is that the Reader produces twice as many objects as there are
> > lines in the file that is being read.
> >
> > In both scenarios, the reason is that the CSV Reader has a wrong number
> > of column definitions.
> >
> > Of course that is my fault: why do I feed a "malformed" CSV file to poor
> > NeoCSVReader?
> >
> > Let me explain: we have a few import interfaces which end users can
> > define using a more or less nice assistant in our Application. The CSV
> > files they upload to our App come from third parties like payment
> > providers, banks and other sources. These change their file structures
> > whenever they feel like it and never tell anybody. So a CSV import that
> > may have been working for years may one day tear a whole web server
> > image down because of a wrong number of fieldAccessors. This is bad on
> > many levels.
> >
> > You can easily try the doubling effect at home: define a working CSV
> > Reader and comment out one of the addField: commands before you use the
> > NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4
> > columns each. If you remove one of the fieldAccessors, an #upToEnd will
> > yoield an Array of 6 objects rather than 3.
> >
> > I haven't found the reason for the cases where this leads to an endless
> > loop, but at least this one is clear...
> >
> > I *guess* this is due to the way #readEndOfLine is implemented. It seems
> > to not peek forward to the end of the line. I have the gut feeling
> > #peekChar should peek instead of reading the #next character form the
> > input Stream, but #peekChar has too many senders to just go ahead and
> > mess with it ;-)
> >
> > So I wonder if there are any tried approaches to this problem.
> >
> > One thing I might do is not use #upToEnd, but read each line using
> > PositionableStream>>#nextLine and first check each line if the number of
> > separators matches the number of fieldAccessors minus 1 (and go through
> > the hoops of handling separators in quoted fields and such...). Only if
> > that test succeeds, I would then hand a Stream with the whole line to
> > the reader and do a #next.
> >
> > This will, however, mean a lot of extra cycles for large files. Of
> > course I could do this only for some lines, maybe just the first one.
> > Whatever.
> >
> >
> > But somehow I have the feeling I should get an exception telling me the
> > line is not compatible to the Reader's definition or such. Or
> > #readAtEndOrEndOfLine should just walk the line to the end and ignore
> > the rest of the line, returnong an incomplete object....
> >
> >
> > Maybe I am just missing the right setting or switch? What best practices
> > did you guys come up with for such problems?
> >
> >
> > Thanks in advance,
> >
> >
> > Joachim
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>

[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

Reply via email to