> I want to be able to read from an Avro formatted log file (specifically the 
> History Log file created at the end of a Hadoop job) and create a Comma 
> Separated file of certain log entries. I need a csv file because this is the 
> format that is accepted by post processing software I am working with (eg: 
> Matlab).
> 
> Initially I was using a BASH script to grep and awk from this file and create 
> my CSV file because I needed a very few values from it, and a quick script 
> just worked. I didn't try to get to know what format the log file was in and 
> utilize that. (my bad!)  Now that I need to be scaling up and want to have a 
> reliable way to parse, I would like to try and do it the right way. 
> 
> My question is this: For the above goal, could you please guide me with steps 
> I can follow - such as reading material and libraries I could try to use. As 
> I go through the Quick Start Guide and FAQ, I see that a lot of the 
> information here is geared to someone who wants to use the data serialization 
> and RPC functionality provided by Avro. Given that I only want to be able to 
> "read", where may I start?
> 
> I can comfortably script with BASH and Perl. Given that I only see support 
> for Java, Python and Ruby, I think I can take this as as opportunity to learn 
> Python and get up to speed. 

You could also take a look at the C bindings.  We've recently added a couple of 
command-line tools for outputting the contents of an Avro file to stdout: 
avrocat and avropipe.  avrocat outputs each record in an Avro file on a single 
line, using the JSON encoding defined by the Avro spec [1].  avropipe produces 
a separate line for each “field” in each record; its output is (roughly 
speaking) what you'd get from piping the JSON encoding of each record through 
the jsonpipe [2] tool.  (Technically speaking, it's what you get from putting 
all of the records into a JSON array, and sending that array through jsonpipe.)

[1] http://avro.apache.org/docs/current/spec.html#json_encoding
[2] https://github.com/dvxhouse/jsonpipe

So, with the example quickstop.db file, the avrocat gives you:

  $ avrocat examples/quickstop.db | head
  {"ID": 1, "First": "Dante", "Last": "Hicks", "Phone": "(0)", "Age": 32}
  {"ID": 2, "First": "Randal", "Last": "Graves", "Phone": "(555) 123-5678", 
"Age": 30}
  {"ID": 3, "First": "Veronica", "Last": "Loughran", "Phone": "(555) 123-0987", 
"Age": 28}
  {"ID": 4, "First": "Caitlin", "Last": "Bree", "Phone": "(555) 123-2323", 
"Age": 27}
  {"ID": 5, "First": "Bob", "Last": "Silent", "Phone": "(555) 123-6422", "Age": 
29}
  {"ID": 6, "First": "Jay", "Last": "???", "Phone": "(0)", "Age": 26}
  {"ID": 7, "First": "Dante", "Last": "Hicks", "Phone": "(1)", "Age": 32}
  {"ID": 8, "First": "Randal", "Last": "Graves", "Phone": "(555) 123-5678", 
"Age": 30}
  {"ID": 9, "First": "Veronica", "Last": "Loughran", "Phone": "(555) 123-0987", 
"Age": 28}
  {"ID": 10, "First": "Caitlin", "Last": "Bree", "Phone": "(555) 123-2323", 
"Age": 27}

While avropipe gives you:

  $ avropipe examples/quickstop.db | head -n 25
  /     []
  /0    {}
  /0/ID 1
  /0/First      "Dante\u0000"
  /0/Last       "Hicks\u0000"
  /0/Phone      "(0)\u0000"
  /0/Age        32
  /1    {}
  /1/ID 2
  /1/First      "Randal\u0000"
  /1/Last       "Graves\u0000"
  /1/Phone      "(555) 123-5678\u0000"
  /1/Age        30
  /2    {}
  /2/ID 3
  /2/First      "Veronica\u0000"
  /2/Last       "Loughran\u0000"
  /2/Phone      "(555) 123-0987\u0000"
  /2/Age        28
  /3    {}
  /3/ID 4
  /3/First      "Caitlin\u0000"
  /3/Last       "Bree\u0000"
  /3/Phone      "(555) 123-2323\u0000"
  /3/Age        27

Although I'm seeing a bug there, since those NUL terminators shouldn't appear 
in the output.  I'm going to open a ticket for that and fix it real quick.  
But, these tools might be exactly what you need, especially since the C 
bindings don't have any library dependencies to install.

cheers
–doug

Reply via email to