[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

jtuc...@objektfabrik.de Wed, 06 Jan 2021 01:23:24 -0800

Richard,

I am not sure what point you are trying to make here.
You have something cooler and faster? Great, how about sharing?

You could make a faster one when it doesn't convert numbers and stuff?Great. I guess the time will be spent after parsing in 95% of the usecases. It depends. And that is exactly what you are saying. The wordefficient means nothing without context. How is that related to this thread?

I think this thread mostly shows the strength of a community, especiallywhen there are members who are active, friendly and highly motivated. Myproblem git solved in blazing speed without me paying anything for it.Just because Sven thought my problem could be other people's problem aswell.

I am happy with NeoCSV's speed, even if there may be more lightweigt andfaster solutions. Tbh, my main concern with NeoCSV is not speed, but howwell I can understand problems and fix them. I care about data types onparsing. A non-configurable csv parser gives me a bunch of dictionariesand Strings. That could be a waste of cycles and memory once you needthe data as objects.My use case is not importing trillions of records all day, and for a fewhundred or maybe sometimes thousands, it is good/fast enough.



Joachim





Am 06.01.21 um 05:10 schrieb Richard O'Keefe:

NeoCSVReader is described as efficient.  What is that
in comparison to? What benchmark data are used?
Here are benchmark results measured today.
(5,000 data line file, 9,145,009 characters).
 method                time(ms)
 Just read characters   410

CSVDecoder>>next 3415 astc's CSV reader (defaults). 1.26 xCSVParser NeoCSVReader>>next 4798 NeoCSVReader (default state). 1.78 xCSVParser CSVParser>>next 2701 pared-to-the-bone CSV reader. 1.00reference.


(10,000 data line file, 1,544,836 characters).
 method                time(ms)
 Just read characters    93

CSVDecoder>>next 530 astc's CSV reader (defaults). 1.26 xCSVParser NeoCSVReader>>next 737 NeoCSVReader (default state). 1.75 xCSVParser CSVParser>>next 421 pared-to-the-bone CSV reader. 1.00reference.


CSVParser is just 78 lines and is not customisable.  It really is
stripped to pretty much an absolute minimum.  All of the parsers
were configured (if that made sense) to return an Array of Strings.
Many of the CSV files I've worked with use short records instead

of ending a line with a lot of commas. Some of them also have theoccasional stray comment off to the right, not mentioned in the header.

I've also found it necessary to skip multiple lines at the beginning
and/or end.  (Really, some government agencies seem to have NO idea
that anyone might want to do more with a CSV file than eyeball it in
Excel.)

If there is a benchmark suite I can use to improve CSVDecoder,
I would like to try it out.

On Tue, 5 Jan 2021 at 02:36, jtuc...@objektfabrik.de<mailto:jtuc...@objektfabrik.de> <jtuc...@objektfabrik.de<mailto:jtuc...@objektfabrik.de>> wrote:


    Happy new year to all of you! May 2021 be an increasingly less crazy
    year than 2020...


    I have a question that sounds a bit strange, but we have two effects
    with NeoCSVReader related to wrong definitions of the reader.

    One effect is that reading a Stream #upToEnd leads to an endless
    loop,
    the other is that the Reader produces twice as many objects as
    there are
    lines in the file that is being read.

    In both scenarios, the reason is that the CSV Reader has a wrong
    number
    of column definitions.

    Of course that is my fault: why do I feed a "malformed" CSV file
    to poor
    NeoCSVReader?

    Let me explain: we have a few import interfaces which end users can
    define using a more or less nice assistant in our Application. The
    CSV
    files they upload to our App come from third parties like payment
    providers, banks and other sources. These change their file
    structures
    whenever they feel like it and never tell anybody. So a CSV import
    that
    may have been working for years may one day tear a whole web server
    image down because of a wrong number of fieldAccessors. This is
    bad on
    many levels.

    You can easily try the doubling effect at home: define a working CSV
    Reader and comment out one of the addField: commands before you
    use the
    NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines
    with 4
    columns each. If you remove one of the fieldAccessors, an #upToEnd
    will
    yoield an Array of 6 objects rather than 3.

    I haven't found the reason for the cases where this leads to an
    endless
    loop, but at least this one is clear...

    I *guess* this is due to the way #readEndOfLine is implemented. It
    seems
    to not peek forward to the end of the line. I have the gut feeling
    #peekChar should peek instead of reading the #next character form the
    input Stream, but #peekChar has too many senders to just go ahead and
    mess with it ;-)

    So I wonder if there are any tried approaches to this problem.

    One thing I might do is not use #upToEnd, but read each line using
    PositionableStream>>#nextLine and first check each line if the
    number of
    separators matches the number of fieldAccessors minus 1 (and go
    through
    the hoops of handling separators in quoted fields and such...).
    Only if
    that test succeeds, I would then hand a Stream with the whole line to
    the reader and do a #next.

    This will, however, mean a lot of extra cycles for large files. Of
    course I could do this only for some lines, maybe just the first one.
    Whatever.


    But somehow I have the feeling I should get an exception telling
    me the
    line is not compatible to the Reader's definition or such. Or
    #readAtEndOrEndOfLine should just walk the line to the end and ignore
    the rest of the line, returnong an incomplete object....


    Maybe I am just missing the right setting or switch? What best
    practices
    did you guys come up with for such problems?


    Thanks in advance,


    Joachim


--
-----------------------------------------------------------------------
Objektfabrik Joachim Tuchel          mailto:jtuc...@objektfabrik.de
Fliederweg 1                         http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1

[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

Reply via email to