Expanding specifically on the JSON streaming idea; (Sorry if I'm rehearsing ideas already stated, I just subscribed yesterday and only glanced at the recent archives)
2014-03-21 6:50 GMT-04:00 Zev Weiss <z...@bewilderbeest.net>: > (Though w.r.t another aspect of Marc-Antoine's comment -- JSON doesn't > necessarily have to be un-streamable, does it? Couldn't you just leave the > top-level structure of the output file as the concatenation of a bunch of > discrete JSON objects, without wrapping them up in an array or similar?) Having a stream of individually JSON encoded items would probably be fine but then decoding must be simple. For example, each line could be a JSON string, separated by a simple \n, e.g. -- CUT HERE -- {"version":"0.1","pid":82323,"ppid":3342,"cwd":"/home/blank","uid":123","functions":[...],...} [1395408175.21312,0,"open",["/path/to/file",0700]] [1395408175.56843,1,3] -- CUT HERE -- The file itself is *not* a valid JSON file, it's a \n joined list of JSON encoded packets. Where the hypothetical format is: - First item: a dict describing the format and global state at the start of the log. It's fine for this line to be verbose, since it occurs only once in the log. It tells the reader how to read the rest of the file. File format versioning FTW. - Rest: it is composed of a single list with one common part and one variable part. Common: [timestamp, returnid, if returnid == 0, it's a call, else it's a return. For call, ..., function_name, [args] ] For return, ..., returnvalue ] The returnid permits a strict match of call->return lines. The returnid value is the index of the log entry where the call was logged, which is omitted in the common part of the line itself for brevity. I had initially put it in the log line but I think keeping it as dense as possible has value. For compactness, the function name could be a function id as a number instead. So the actual log lines are relatively dense even if text/ascii encoded. I'm not describing other things like signal and process events, since it's really just an example design but the general of common part + variable part would remain. I think it would be relatively easy "reader implementation wise" to do. That said, two problems remains about the encoding itself: - JSON assume double for their numbers and by default are encoded in base10 as strings. So using something like hex encoded in a string would be more efficient. - JSON string assume unicode. This could mean using custom escaping for byte streams. base64 is a valid option in that case but this reduces bit density by 37%. There's two completely separate questions: - each packet itself could be encoded with "something" where I picked JSON as the something. - defining each packet properly, I exposed an example for 3 packet descriptions. For the encoding itself, the big question is: Do you want the output to be ascii or binary? That influence what you are going to select. One potentially interesting side effect of JSON to a subset of users is that it's trivial to read from python because JSON is included in its stdlib. Using something like BSON or MessagePack means the user will have to install these third parties first. No big deal but still one more step to do. It could be annoying when a sysadmin want to login to a server and quickly diagnose something. To state the obvious, using a encoding that has wide spread support in many languages (I'd say at least perl, python, C++) would be better. Just sayin', I'm not vested in this choice. M-A ------------------------------------------------------------------------------ Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech _______________________________________________ Strace-devel mailing list Strace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/strace-devel