[ 
https://issues.apache.org/jira/browse/MAPREDUCE-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742921#action_12742921
 ] 

Doug Cutting commented on MAPREDUCE-157:
----------------------------------------

Owen> Of course reading is the reverse. It would be like writing xml files by 
generating the necessary DOM objects.

Not sure what you mean.  Jackson has an event-based JSON reading API.

http://jackson.codehaus.org/1.2.0/javadoc/org/codehaus/jackson/JsonParser.html

So, to efficiently read things back into structs you might use an enum of field 
names, e.g.:
{code}
class Foo { int a; String b; }
enum FooFields { A, B }
void readFoo(JsonParser parser, Foo foo) {
  if (parser.nextToken() != JsonToken.START_OBJECT)
    throw new Exception();
  while(parser.nextToken() != JsonToken.END_OBJECT) {
    parser.nextToken();
    switch (Enum.getValue(FooFields.class, parser.getCurrentName()))) {
    case A: foo.a = parser.getIntValue(); break;
    case B: foo.b = parser.getText(); break;
    }
  }
}
{code}

FWIW, Avro supports SAX-like streaming, without object creation.  A significant 
change if we used Avro would be that we'd need to store the schema with the 
data.  We could, for example, make the first line of log files the schema, or 
write a side file, but there's not much point to Avro data without storing a 
schema.

Is the implicit schema proposed here Map<String,String>?  For example, would 
integer values be written as JSON strings, with quotes, or as JSON integers, 
without quotes?  If the schema is Map<String,String> and will be for all time, 
then there's less point to using Avro.  But if fields are typed it might be 
nice to record the types in a schema.

> Job History log file format is not friendly for external tools.
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-157
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-157
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>            Reporter: Owen O'Malley
>            Assignee: Jothi Padmanabhan
>
> Currently, parsing the job history logs with external tools is very difficult 
> because of the format. The most critical problem is that newlines aren't 
> escaped in the strings. That makes using tools like grep, sed, and awk very 
> tricky.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to