> I want to be able to read from an Avro formatted log file (specifically the > History Log file created at the end of a Hadoop job) and create a Comma > Separated file of certain log entries. I need a csv file because this is the > format that is accepted by post processing software I am working with (eg: > Matlab). > > Initially I was using a BASH script to grep and awk from this file and create > my CSV file because I needed a very few values from it, and a quick script > just worked. I didn't try to get to know what format the log file was in and > utilize that. (my bad!) Now that I need to be scaling up and want to have a > reliable way to parse, I would like to try and do it the right way. > > My question is this: For the above goal, could you please guide me with steps > I can follow - such as reading material and libraries I could try to use. As > I go through the Quick Start Guide and FAQ, I see that a lot of the > information here is geared to someone who wants to use the data serialization > and RPC functionality provided by Avro. Given that I only want to be able to > "read", where may I start? > > I can comfortably script with BASH and Perl. Given that I only see support > for Java, Python and Ruby, I think I can take this as as opportunity to learn > Python and get up to speed.
You could also take a look at the C bindings. We've recently added a couple of command-line tools for outputting the contents of an Avro file to stdout: avrocat and avropipe. avrocat outputs each record in an Avro file on a single line, using the JSON encoding defined by the Avro spec [1]. avropipe produces a separate line for each “field” in each record; its output is (roughly speaking) what you'd get from piping the JSON encoding of each record through the jsonpipe [2] tool. (Technically speaking, it's what you get from putting all of the records into a JSON array, and sending that array through jsonpipe.) [1] http://avro.apache.org/docs/current/spec.html#json_encoding [2] https://github.com/dvxhouse/jsonpipe So, with the example quickstop.db file, the avrocat gives you: $ avrocat examples/quickstop.db | head {"ID": 1, "First": "Dante", "Last": "Hicks", "Phone": "(0)", "Age": 32} {"ID": 2, "First": "Randal", "Last": "Graves", "Phone": "(555) 123-5678", "Age": 30} {"ID": 3, "First": "Veronica", "Last": "Loughran", "Phone": "(555) 123-0987", "Age": 28} {"ID": 4, "First": "Caitlin", "Last": "Bree", "Phone": "(555) 123-2323", "Age": 27} {"ID": 5, "First": "Bob", "Last": "Silent", "Phone": "(555) 123-6422", "Age": 29} {"ID": 6, "First": "Jay", "Last": "???", "Phone": "(0)", "Age": 26} {"ID": 7, "First": "Dante", "Last": "Hicks", "Phone": "(1)", "Age": 32} {"ID": 8, "First": "Randal", "Last": "Graves", "Phone": "(555) 123-5678", "Age": 30} {"ID": 9, "First": "Veronica", "Last": "Loughran", "Phone": "(555) 123-0987", "Age": 28} {"ID": 10, "First": "Caitlin", "Last": "Bree", "Phone": "(555) 123-2323", "Age": 27} While avropipe gives you: $ avropipe examples/quickstop.db | head -n 25 / [] /0 {} /0/ID 1 /0/First "Dante\u0000" /0/Last "Hicks\u0000" /0/Phone "(0)\u0000" /0/Age 32 /1 {} /1/ID 2 /1/First "Randal\u0000" /1/Last "Graves\u0000" /1/Phone "(555) 123-5678\u0000" /1/Age 30 /2 {} /2/ID 3 /2/First "Veronica\u0000" /2/Last "Loughran\u0000" /2/Phone "(555) 123-0987\u0000" /2/Age 28 /3 {} /3/ID 4 /3/First "Caitlin\u0000" /3/Last "Bree\u0000" /3/Phone "(555) 123-2323\u0000" /3/Age 27 Although I'm seeing a bug there, since those NUL terminators shouldn't appear in the output. I'm going to open a ticket for that and fix it real quick. But, these tools might be exactly what you need, especially since the C bindings don't have any library dependencies to install. cheers –doug