On 2024-09-30 at 21:34:07 +0200, Regarding "Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API," Left Right via Python-list <python-list@python.org> wrote:
> > What am I missing? Handwavingly, start with the first digit, and as > > long as the next character is a digit, multipliy the accumulated result > > by 10 (or the appropriate base) and add the next value. Oh, and handle > > scientific notation as a special case, and perhaps fail spectacularly > > instead of recovering gracefully in certain edge cases. And in the > > pathological case of a single number with 60 billion digits, run out of > > memory (and complain loudly to the person who claimed that the file > > contained a "dataset"). But why do I need to start with the least > > significant digit? > > You probably forgot that it has to be _streaming_. Suppose you parse > the first digit: can you hand this information over to an external > function to process the parsed data? -- No! because you don't know the > magnitude yet. What about two digits? -- Same thing. You cannot > leave the parser code until you know the magnitude (otherwise the > information is useless to the external code). If I recognize the first digit, then I *can* hand that over to an external function to accumulate the digits that follow. > So, even if you have enough memory and don't care about special cases > like scientific notation: yes, you will be able to parse it, but it > won't be a streaming parser. Under that constraint, I'm not sure I can parse anything. How can I parse a string (and hand it over to an external function) until I've found the closing quote? How much state can a parser maintain (before it invokes an external function) and still be considered streaming? I fear that we may be getting hung up on terminology rather than solving the problem at hand. -- https://mail.python.org/mailman/listinfo/python-list