[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

jtuc...@objektfabrik.de Wed, 06 Jan 2021 23:29:37 -0800

Richard,

Am 07.01.21 um 07:15 schrieb Richard O'Keefe:

You aren't sure what point I was making?

exactly, the thread you answered was about a possible bug in NeoCSVparser. Your post was about your doubts about the claim of efficiency onthe parser's web site. So you threw in some completely unrelated topicand started by sounding more or less destructive (maybe this is a wordtoo harsh, but I am not a native english speaker... maybe "challenging"is a better word?).

I cannot comment on the efficiency of NeoCSV, other than it is fastenough for my use case and it gives me the option of combining readingCSV and producing objects in one run, even if some checks,backpointering, whatever has to be done after the parsing. It has a niceAPI and is supported quite well. The thread and Sven's reactionunderline this last statement quite impressively: my bug was fixedwithin hours.

How about the one I actually wrote down:
  What test data was NeoCSV benchmarked with
  and can I get my hands on it?

That is a valid question. It is off-topic in the thread, however. Andmaybe your tone was a bit less kind than it should be. Nevertheless, thediscussion itself is worth its own thread. If the raw speed of readinglots of CSV data is of concern in a use case, we should look for and atalternatives. You are of course free to ask about alternatives, presentyour measurements or alternative implementation and ask for comments,ideas, all kinds of input. That's what yields progress.

THAT is the point. The data points I showed (and
many others I have not) are not satisfactory to me.

Fine and absolutely worth discussing. Maybe in its own discussion threadand started with a friendly invitation for discussion. Your post wasmore like "Oh, and, by the way, NeoCSV sucks". Maybe unintended, butthat is what I read.

I have been searching for CSV test collections.
One site offered 6 files of which only one downloaded.
I found a "benchmark suite" for CSV containing no
actual CSV files.
So where *else* should I look for benchmark data than
associated with a parser people in this community are
generally happy with that is described as "efficient"?

So you would like the developers of NeoCSV to provide test data thatallows for benchmarking and comparison? A valid point.


Is it so unreasonable to suspect that my results might
be a fluke?  Is it bad manners to assume that something
described as efficient has tests showing that?

Well, no. It is absolutely okay to ask if a claim like "efficient" canbe proven. You are free to present better choices and discuss yourdefinition of efficiency.

For me personally, your post sounded a bit like some earlier ones ofyours which seemed to have no other point than "I have something better,but I won't show you". Hence my reaction. I may have read something intoyour post that you haven't written into it. Sorry for that.



Joachim

On Wed, 6 Jan 2021 at 22:23, jtuc...@objektfabrik.de<mailto:jtuc...@objektfabrik.de> <jtuc...@objektfabrik.de<mailto:jtuc...@objektfabrik.de>> wrote:


    Richard,

    I am not sure what point you are trying to make here.
    You have something cooler and faster? Great, how about sharing?
    You could make a faster one when it doesn't convert numbers and
    stuff? Great. I guess the time will be spent after parsing in 95%
    of the use cases. It depends. And that is exactly what you are
    saying. The word efficient means nothing without context. How is
    that related to this thread?

    I think this thread mostly shows the strength of a community,
    especially when there are members who are active, friendly and
    highly motivated. My problem git solved in blazing speed without
    me paying anything for it. Just because Sven thought my problem
    could be other people's problem as well.

    I am happy with NeoCSV's speed, even if there may be more
    lightweigt and faster solutions. Tbh, my main concern with NeoCSV
    is not speed, but how well I can understand problems and fix them.
    I care about data types on parsing. A non-configurable csv parser
    gives me a bunch of dictionaries and Strings. That could be a
    waste of cycles and memory once you need the data as objects.
    My use case is not importing trillions of records all day, and for
    a few hundred or maybe sometimes thousands, it is good/fast enough.


    Joachim





    Am 06.01.21 um 05:10 schrieb Richard O'Keefe:

    NeoCSVReader is described as efficient.  What is that
    in comparison to?  What benchmark data are used?
    Here are benchmark results measured today.
    (5,000 data line file, 9,145,009 characters).
     method                time(ms)
     Just read characters   410
     CSVDecoder>>next      3415   astc's CSV reader (defaults). 1.26
    x CSVParser
     NeoCSVReader>>next    4798   NeoCSVReader (default state). 1.78
    x CSVParser
     CSVParser>>next       2701   pared-to-the-bone CSV reader. 1.00
    reference.

    (10,000 data line file, 1,544,836 characters).
     method                time(ms)
     Just read characters    93
     CSVDecoder>>next       530   astc's CSV reader (defaults). 1.26
    x CSVParser
     NeoCSVReader>>next     737   NeoCSVReader (default state). 1.75
    x CSVParser
     CSVParser>>next        421   pared-to-the-bone CSV reader. 1.00
    reference.

    CSVParser is just 78 lines and is not customisable.  It really is
    stripped to pretty much an absolute minimum.  All of the parsers
    were configured (if that made sense) to return an Array of Strings.
    Many of the CSV files I've worked with use short records instead
    of ending a line with a lot of commas.  Some of them also have
    the occasional stray comment off to the right, not mentioned in
    the header.
    I've also found it necessary to skip multiple lines at the beginning
    and/or end. (Really, some government agencies seem to have NO idea
    that anyone might want to do more with a CSV file than eyeball it in
    Excel.)

    If there is a benchmark suite I can use to improve CSVDecoder,
    I would like to try it out.

    On Tue, 5 Jan 2021 at 02:36, jtuc...@objektfabrik.de
    <mailto:jtuc...@objektfabrik.de> <jtuc...@objektfabrik.de
    <mailto:jtuc...@objektfabrik.de>> wrote:

        Happy new year to all of you! May 2021 be an increasingly
        less crazy
        year than 2020...


        I have a question that sounds a bit strange, but we have two
        effects
        with NeoCSVReader related to wrong definitions of the reader.

        One effect is that reading a Stream #upToEnd leads to an
        endless loop,
        the other is that the Reader produces twice as many objects
        as there are
        lines in the file that is being read.

        In both scenarios, the reason is that the CSV Reader has a
        wrong number
        of column definitions.

        Of course that is my fault: why do I feed a "malformed" CSV
        file to poor
        NeoCSVReader?

        Let me explain: we have a few import interfaces which end
        users can
        define using a more or less nice assistant in our
        Application. The CSV
        files they upload to our App come from third parties like
        payment
        providers, banks and other sources. These change their file
        structures
        whenever they feel like it and never tell anybody. So a CSV
        import that
        may have been working for years may one day tear a whole web
        server
        image down because of a wrong number of fieldAccessors. This
        is bad on
        many levels.

        You can easily try the doubling effect at home: define a
        working CSV
        Reader and comment out one of the addField: commands before
        you use the
        NeoCSVReader to parse a CSV file. Say your CSV file has 3
        lines with 4
        columns each. If you remove one of the fieldAccessors, an
        #upToEnd will
        yoield an Array of 6 objects rather than 3.

        I haven't found the reason for the cases where this leads to
        an endless
        loop, but at least this one is clear...

        I *guess* this is due to the way #readEndOfLine is
        implemented. It seems
        to not peek forward to the end of the line. I have the gut
        feeling
        #peekChar should peek instead of reading the #next character
        form the
        input Stream, but #peekChar has too many senders to just go
        ahead and
        mess with it ;-)

        So I wonder if there are any tried approaches to this problem.

        One thing I might do is not use #upToEnd, but read each line
        using
        PositionableStream>>#nextLine and first check each line if
        the number of
        separators matches the number of fieldAccessors minus 1 (and
        go through
        the hoops of handling separators in quoted fields and
        such...). Only if
        that test succeeds, I would then hand a Stream with the whole
        line to
        the reader and do a #next.

        This will, however, mean a lot of extra cycles for large
        files. Of
        course I could do this only for some lines, maybe just the
        first one.
        Whatever.


        But somehow I have the feeling I should get an exception
        telling me the
        line is not compatible to the Reader's definition or such. Or
        #readAtEndOrEndOfLine should just walk the line to the end
        and ignore
        the rest of the line, returnong an incomplete object....


        Maybe I am just missing the right setting or switch? What
        best practices
        did you guys come up with for such problems?


        Thanks in advance,


        Joachim

-------------------------------------------------------------------------

    Objektfabrik Joachim Tuchelmailto:jtuc...@objektfabrik.de  
<mailto:jtuc...@objektfabrik.de>
    Fliederweg 1http://www.objektfabrik.de  <http://www.objektfabrik.de>
    D-71640 Ludwigsburghttp://joachimtuchel.wordpress.com  
<http://joachimtuchel.wordpress.com>
    Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1


--
-----------------------------------------------------------------------
Objektfabrik Joachim Tuchel          mailto:jtuc...@objektfabrik.de
Fliederweg 1                         http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1

[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

Reply via email to