Hi Russel, thanks for pointing out the python lib. I created a little converter script that reads in json using json.loads and writes the resulting object to avro using a specific schema. (Or does the lib already contain such a converter and I just missed it?)
Thanks for the help, -markus Am 19.09.2012 um 01:18 schrieb Russell Jurney: > Fwiw, I do this in web apps all the time via the python avro lib and > json.dumps > > Russell Jurney > twitter.com/rjurney > russell.jur...@gmail.com > datasyndrome.com > > On Sep 18, 2012, at 12:38 PM, Doug Cutting <cutt...@apache.org> wrote: > >> On Tue, Sep 18, 2012 at 11:34 AM, Markus Strickler <mar...@braindump.ms> >> wrote: >>> Json.Writer is indeed what I had in mind and I have successfully managed to >>> convert my existing JSON to avro using it. >>> However using GenericDatumReader on this feels pretty unnatural, as I seem >>> to be unable to access fields directly. It seems I have to access the >>> "value" field on each record which returns a Map which uses Utf8 Objects as >>> keys for the actual fields. Or am I doing something wrong here? >> >> Hmm. We could re-factor Json.SCHEMA so the union is the top-level >> element. That would get rid of the wrapper around every value. It's >> a more redundant way to write the schema, but the binary encoding is >> identical (since a record wrapper adds no bytes). It would hence >> require no changes to Json.Reader or Json.Writer. >> >> [ "long", >> "double", >> "string", >> "boolean", >> "null", >> {"type" : "array", >> "items" : { >> "type" : "record", >> "name" : "org.apache.avro.data.Json", >> "fields" : [ { >> "name" : "value", >> "type" : [ "long", "double", "string", "boolean", "null", >> {"type" : "array", "items" : "Json"}, >> {"type" : "map", "values" : "Json"} >> ] >> } ] >> } >> }, >> {"type" : "map", "values" : "Json"} >> ] >> >> You can try this by placing this schema in >> share/schemas/org/apache/avro/data/Json.avsc and re-building the avro >> jar. >> >> Would such a change be useful to you? If so, please file an issue in Jira. >> >> Or we could even refactor this schema so that a Json object is the >> top-level structure: >> >> {"type" : "map", >> "values" : [ "long", >> "double", >> "string", >> "boolean", >> "null", >> {"type" : "array", >> "items" : { >> "type" : "record", >> "name" : "org.apache.avro.data.Json", >> "fields" : [ { >> "name" : "value", >> "type" : [ "long", "double", "string", "boolean", >> "null", >> {"type" : "array", "items" : "Json"}, >> {"type" : "map", "values" : "Json"} >> ] >> } ] >> } >> }, >> {"type" : "map", "values" : "Json"} >> ] >> } >> >> This would change the binary format but would not change the >> representation that GenericDatumReader would hand you from my first >> example above (since the generic representation unwraps unions). >> Using this schema would require changes to Json.Writer and >> Json.Reader. It would better conform to the definition of Json, which >> only permits objects as the top-level type. >> >>> Concerning the more specific schema, you are of course completely right. >>> Unfortunately more or less all the fields in the JSON data format are >>> optional and many have substructures, so, at least in my understanding, I >>> have to use unions of null and the actual type throughout the schema. I >>> tried using JsonDecoder first (or rather the fromjson option of the avro >>> tool, which, I think, uses JsonDecoder) but given the current JSON >>> structures, this didn't work. >> >>> So I'll probably have to look into implementing my own converter. However >>> given the rather complex structure of the original JSON I'm wondering if >>> trying to represent the data in avro is such a good idea in the first place. >> >> It would be interesting to see whether, with the appropriate schema, >> whether the dataset is smaller and faster to process as Avro than as >> Json. If you have 1000 fields in your data but the typical record >> only has one or two non-null, then an Avro record is perhaps not a >> good representation. An Avro map might be better, but if the values >> are similarly variable then Json might be competitive. >> >> Cheers, >> >> Doug