Re: [GSOC 2014] structured output of strace

Zev Weiss Fri, 21 Mar 2014 04:10:32 -0700

(CCing strace-devel as well, hope that's OK...)

On Mar 20, 2014, at 12:54 PM, yangmin zhu <zym00...@gmail.com> wrote:


> Hi,
>  I'm yangmin zhu. I'm a master student from University of Chinese Academy of 
> Sciences and now I'm participating in the Google Summer of Code 2014.
>  I'm working for the strace project about structured output. You can find 
> more information from [1] and [2]. And I find your work from [3] and [4].
>  I think it would be great to contact the strace output parser's author to 
> collect their actual needs. I'm trying to modify strace to support output in 
> JSON format. But I'm not very clear what the exact format you want.
>   For examole,
> 1) should all the value in the JSON output be string? or some value should be 
> number?
> 2) which of the followling style of syscall's arguments do you prefer?
>   "args" : ["arg1", "arg2", "arg3" ] 
> or
>   "arg1" : [ "arg1_name" : "arg1_value", "arg2_name" : "arg2_value" ]
> 
> ANY suggestions are welcome.
> 
> Thank you.
> 
> yangmin zhu

Hi Yangmin,

Firstly, thanks for getting in touch!

On the specifics you mentioned:

1) I think using "real" types (e.g. actual integers instead of string-encoded 
ones) wherever possible would be highly preferable in order to simplify parsing 
by downstream structured-output consumers.

2) I guess I don't have any real strong opinions at this point on whether 
syscall arguments should be named in a map/dictionary style collection or a 
simple ordered list/array.  I could see the map keys being potentially useful 
in certain situations, but looked at over an entire trace it seems like it 
would result in a great deal of redundancy (e.g. duplicating "domain", "type", 
and "protocol" for every instance of a socket(2) call); also I'd guess that 
many if not most potential consumers of structured output would need (or 
already have) some awareness of syscall parameter lists built into them anyway, 
so I guess I'd probably lean toward a plain unlabeled array.

Also, while I mentioned previously on the list that I'd probably be in favor of 
JSON-structured output, that was based on a fairly cursory knowledge of the 
format, basically just from having seen examples of it in lots of places.  It 
has since been pointed out though that it might not be such a great candidate 
-- for instance, with regard to point #1 above, JSON has the major disadvantage 
here (as mentioned by Elliott Hughes) of inheriting javascript's unfortunate 
"all numbers are doubles" brain-damage.  Also (as noted by Marc-Antoine Ruel), 
while JSON's inherent verbosity is certainly much less than, say, XML, it's 
still perhaps a bit "larger" than would be desirable.  (Though w.r.t another 
aspect of Marc-Antoine's comment -- JSON doesn't necessarily have to be 
un-streamable, does it?  Couldn't you just leave the top-level structure of the 
output file as the concatenation of a bunch of discrete JSON objects, without 
wrapping them up in an array or similar?)

So I think it might be worth considering some possible alternatives to JSON...a 
few I'm vaguely aware of and/or have just done some brief research on now:

XML: ugly, bloated and verbose, unpopular with lots of people (myself 
included), just mentioning "because it's there", though I'd vote against it.

MessagePack (http://msgpack.org/):
 - more compact than JSON
 - binary, not text -- obviously less human-readable, but presumably for 
structured output we care more about ease of consumption by programs, not 
humans (and for programmatic use a binary format is significantly simpler than 
text, I'd say).  If human-readability is desired we'll still have the current 
output format available; I see no reason to try to optimize one output format 
for both purposes.
 - type system seems much better-suited for strace's purposes (has 64-bit ints, 
for one thing), and offers application-specific extensibility if needed.
 - not nearly as ubiquitous as JSON, but already has existing serdes 
implementations for lots of languages (https://github.com/msgpack)

BSON (http://bsonspec.org/):
 - similar to MessagePack in a lot of ways, I think, but has the property that 
in order to be well-formed and spec-compliant, a top-level document must be 
prefixed with a total-length descriptor, which seems like it would a 
deal-breaker for strace (we'd have to be able to start streaming out the trace 
before we know how long it is).  That said, I suppose there's no reason strace 
couldn't just output a concatenation of smaller discrete BSON documents (as 
mentioned above with JSON).
 - type system: certainly a better fit for strace than JSON (has 64-bit ints), 
but seems generally a bit cruftier than MessagePack, with a lot of oddball bits 
and pieces thrown in (regexes, JS code, MD5s...none of which strace would need 
to use, but just seem like weird things to have).  Despite being a fairly young 
format, already has a bunch features marked "old" or "deprecated", which to me 
(at least superficially) gives it the appearance of maybe being not all that 
well-designed.


So, given all that, I think MessagePack is actually looking fairly appealing, 
personally.


Thanks,
Zev


------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Strace-devel mailing list
Strace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/strace-devel

Re: [GSOC 2014] structured output of strace

Reply via email to