[go-nuts] Re: How to stream across json-seq RFC-7464

Greg Saylor Sun, 28 Mar 2021 11:17:44 -0700

The inner blob is expecting an io.Reader.  But, perhaps I can change that 
to pass a Decoder based on what you are saying.   For some reason I hadn't 
grokked that is how Decoder was working.  Just to re-iterate what I think 
you are saying (and in case anyone stumbles across this thread later), 
assuming a file that has this type of structure (call each of the outer 
blobs A, B, C for reference):


{
  [
   {...},
   {...}
  ]
}
{
  [
   {...},
   {...}
  ]
}
[
  {...},
  {...}
]


The first call to Decoder() will move the pointer to the first `{` in A.
   Something like exponen-io.jsonpath Seek() could be used to advance to 
A's `[`
   The second call to Decoder(), with the embedded reader,  will set the 
position at A's first inner {...}
   Each subsequent call to Decode() will process each inner {...} of A one 
at a time until More() is false,  at which point the position is at A's `]`

The third call to Decoder() will move the pointer to the first `{` in B.  
*Question: 
Is this in fact correct?  If not how to I get reader to this point of the 
stream?*
   The fourth call to Decoder() will allow me to stream read to B's `[` (in 
this case using exponent-io.jsonpath SeekTo() or some other mechanism)
   Each subsequent call to Decode() will process each inner {...} of B one 
at a time  until More() is false, at which point the position is at B's `]`

The fifth call to Decoder() will move the pointer to the first `[` in C.
   Each subsequent call to Decode() will process each inner {...} of C one 
at a time until More() is false


I realize this may not what actually is going internally inside these 
packages, but at a high level is that conceptually something approaching 
what is going on?

If this is true, I gotta say this is one of the things I *LOVE* about Go. 
 I cannot count the number of times I had some complicated problem which, 
which Go makes a  whole lot easier.  Or put another way: I was 
over-complicating the problem and not recognizing the underlying code 
defect which should change.   In fact, even refactoring this code even 
though its used in about 100 places would be trivial.  I could probably 
just use perl -pie to fix the code.  And also, if I may be a bit indulgent 
here, the quality of the answers that come out of the Golang community are 
just amazing.  I love reading this mailing list even though I've only 
posted to it a few times.

- Greg


On Sunday, March 28, 2021 at 1:26:17 AM UTC-7 Brian Candler wrote:

> > This works, but the downside is that each {...} of bytes has to be 
> pulled into memory.  And the functions that is called is already designed 
> to receive an io.Reader and parse the VERY large inner blob in an efficient 
> manner.
>
> Is the inner blob decoder actually using a json.Decoder, as shown in your 
> example func secondDecoder()?  In that case, the simplest and most 
> efficient answer is to create a persistent json.Decoder which wraps the 
> underlying io.Reader directly, and just keep calling w2.Decode(&v) on each 
> call.  It will happily consume the stream, one object at a time.
>
> If that's not possible for some reason, then it sounds like you want to 
> break the outer stream at outer object boundaries, i.e. { ... }, without 
> fully parsing it.  You can do that with json.RawMessage:
> https://play.golang.org/p/BitE6l27160
>
> However, you've still read each object as a stream of bytes into memory, 
> and you've still done some of the work of parsing the JSON to find the 
> start and end of each object.  You can turn it back into an io.Reader by 
> creating a bytes.NewBuffer around it, if that's what the inner parser 
> requires.   However if each object is large, and you really need to avoid 
> reading it into memory at all, then you'd need some sort of rewindable 
> stream.
>
> Another approach is to stop the source generating pretty-printed JSON, and 
> make it generate in JSON-Lines <https://jsonlines.org/> format instead.  
> It sounds like you're unable to change the source, but you might be able to 
> un-prettyprint the JSON by using an external tool (perhaps jq can do 
> this).  Then I am thinking you could make a custom io.Reader which returns 
> data up to a newline, then sends EOF and sends you a fresh io.Reader for 
> the next line.
>
> But this is all very complicated, when keeping the inner Decoder around 
> from object to object is a simple solution to the problem that you 
> described.  Is there some other constraint which prevents you from doing 
> this?
>
> On Saturday, 27 March 2021 at 19:42:40 UTC greg.sa...@gmail.com wrote:
>
>> Good afternoon,
>>
>> For a case where there's a file containing a sequence of hashes (it could 
>> be arrays too, as the underlying object type seems irrelevant) as per 
>> RFC-7464.  I cannot figure out how to handle this in a memory efficient way 
>> that doesn't involve pulling each blob 
>>
>> I've tried to express this on Go playground here: 
>> https://play.golang.org/p/Aqx0gnc39rn
>> Note that I'm using exponent-io/jsonpath as the JSON decoder, but 
>> certainly that could be swapped for something else.
>>
>> In essence here is an example of the input bytes:
>>
>> {
>>    "elements" : [
>>       {
>>          "Space" : "YCbCr",
>>          "Point" : {
>>             "Cb" : 0,
>>             "Y" : 255,
>>             "Cr" : -10
>>          }
>>       },
>>       {
>>          "Point" : {
>>             "B" : 255,
>>             "R" : 98,
>>             "G" : 218
>>          },
>>          "Space" : "RGB"
>>       }
>>    ]
>> }
>> {
>>    "elements" : [
>>       {
>>          "Space" : "YCbCr",
>>          "Point" : {
>>             "Cb" : 3000,
>>             "Y" : 355,
>>             "Cr" : -310
>>          }
>>       },
>>       {
>>          "Space" : "RGB",
>>          "Point" : {
>>             "B" : 355,
>>             "G" : 318,
>>             "R" : 108
>>          }
>>       }
>>    ]
>> }
>> {
>>    "elements" : [
>>       {
>>          "Space" : "YCbCr",
>>          "Point" : {
>>             "Cr" : -410,
>>             "Cb" : 400,
>>             "Y" : 455
>>          }
>>       },
>>       {
>>          "Space" : "RGB",
>>          "Point" : {
>>             "B" : 455,
>>             "R" : 118,
>>             "G" : 418
>>          }
>>       }
>>    ]
>> }
>>
>> I can iterate through that with this code:
>>
>> w := json.NewDecoder(bytes.NewReader(j))
>> for w.More() {
>> var v interface{}
>> w.Decode(&v)
>> fmt.Printf("%+v\n", v)
>> }
>>
>> This works, but the downside is that each {...} of bytes has to be pulled 
>> into memory.  And the functions that is called is already designed to 
>> receive an io.Reader and parse the VERY large inner blob in an efficient 
>> manner.
>>
>> So in principal, this is kinda want I want to do, but maybe I'm looking 
>> at it all wrong:
>>
>>
>> w := json.NewDecoder(bytes.NewReader(j))
>> for w.More() {
>> reader2 := ???? //Some io.Reader that represents each of the 3 json-seq 
>> blocks
>> secondDecoder(reader2)
>> }
>>
>> func secondDecoder(reader io.Reader) {
>> w2 := json.NewDecoder(reader)
>> var v interface{}
>> w2.Decode(&v)
>> fmt.Printf("%+v\n", v)
>> }
>>
>> Any ideas on how to solve this problem?
>>
>> I should note that it is not possible for the input to change in this 
>> case as the system that consumes it is not the same one that has been 
>> generating it for the past 5 years.
>>
>> Thanks!
>>
>> - Greg
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/58472f92-aa24-43a1-b22a-adc8f872e8ccn%40googlegroups.com.

[go-nuts] Re: How to stream across json-seq RFC-7464

Reply via email to