The fix was this: 

{
    "type":"record",
    "name":"Email",
    "fields":
    [
        {
            "name":"message_id",
            "type":["null","string"],
            "doc":""
        },
        {
            "name":"in_reply_to",
            "type": ["string", "null"]
        },
        {
            "name":"subject", 
            "type": ["string", "null"]
        },
        {
            "name":"body", 
            "type": ["string", "null"]
        },
        {
            "name":"date", 
            "type": ["string", "null"]
        },
        {
            "name":"froms",
            "type":
            [
                "null",
                {
                    "type":"array",
                    "items":
                    [
                        "null",
                        {
                            "type":"record",
                            "name":"from",
                            "fields":
                            [
                                {
                                    "name":"real_name",
                                    "type":["null","string"],
                                    "doc":""
                                },
                                {
                                    "name":"address",
                                    "type":["null","string"],
                                    "doc":""
                                }
                            ]
                        }
                    
                    ]
                }
            ],
            "doc":""
        },
        {
            "name":"tos",
            "type":
            [
                "null",
                {
                    "type":"array",
                    "items":
                    [
                        "null",
                        {
                            "type":"record",
                            "name":"to",
                            "fields":
                            [
                                {
                                    "name":"real_name",
                                    "type":["null","string"],
                                    "doc":""
                                },
                                {
                                    "name":"address",
                                    "type":["null","string"],
                                    "doc":""
                                }
                            ]
                        }
                    
                    ]
                }
            ],
            "doc":""
        },        
        {
            "name":"ccs",
            "type":
            [
                "null",
                {
                    "type":"array",
                    "items":
                    [
                        "null",
                        {
                            "type":"record",
                            "name":"cc",
                            "fields":
                            [
                                {
                                    "name":"real_name",
                                    "type":["null","string"],
                                    "doc":""
                                },
                                {
                                    "name":"address",
                                    "type":["null","string"],
                                    "doc":""
                                }
                            ]
                        }
                    
                    ]
                }
            ],
            "doc":""
        },
        {
            "name":"bccs",
            "type":
            [
                "null",
                {
                    "type":"array",
                    "items":
                    [
                        "null",
                        {
                            "type":"record",
                            "name":"bcc",
                            "fields":
                            [
                                {
                                    "name":"real_name",
                                    "type":["null","string"],
                                    "doc":""
                                },
                                {
                                    "name":"address",
                                    "type":["null","string"],
                                    "doc":""
                                }
                            ]
                        }
                    
                    ]
                }
            ],
            "doc":""
        },
        {
            "name":"reply_tos",
            "type":
            [
                "null",
                {
                    "type":"array",
                    "items":
                    [
                        "null",
                        {
                            "type":"record",
                            "name":"reply_to",
                            "fields":
                            [
                                {
                                    "name":"real_name",
                                    "type":["null","string"],
                                    "doc":""
                                },
                                {
                                    "name":"address",
                                    "type":["null","string"],
                                    "doc":""
                                }
                            ]
                        }
                    
                    ]
                }
            ],
            "doc":""
        }
    ]
}

On Tue, Apr 10, 2012 at 2:36 AM, Russell Jurney <russell.jur...@gmail.com> 
wrote:
Hmmmm unable to get this to work:

{
    "namespace": "agile.data.avro",
    "name": "Email",
    "type": "record",
    "fields": [
        {"name":"message_id", "type": ["string", "null"]},
        {"name":"froms","type": [{"type":"record", "name":"from", "fields": 
[{"type":"array", "items":"string"}, "null"]}, "null"]},
        {"name":"tos","type": [{"type":"record", "name":"to", "fields": 
[{"type":"array", "items":"string"}, "null"]}, "null"]},
        {"name":"ccs","type": [{"type":"record", "name":"cc", "fields": 
[{"type":"array", "items":"string"}, "null"]}, "null"]},
        {"name":"bccs","type": [{"type":"record", "name":"bcc", "fields": 
[{"type":"array", "items":"string"}, "null"]}, "null"]},
        {"name":"reply_tos","type": [{"type":"record", "name":"reply_to", 
"fields": [{"type":"array", "items":"string"}, "null"]}, "null"]},
        {"name":"in_reply_to", "type": [{"type":"array", "items":"string"}, 
"null"]},
        {"name":"subject", "type": ["string", "null"]},
        {"name":"body", "type": ["string", "null"]},
        {"name":"date", "type": ["string", "null"]}
    ]
}

On Tue, Apr 10, 2012 at 2:26 AM, Russell Jurney <russell.jur...@gmail.com> 
wrote:
In thinking about it more... it seems that unfortunately, the only thing I can 
really do is to change the schema for all email address fields:

{"name":"from","type": [{"type":"array", "items":"string"}, "null"]},
to:
{"name":"froms","type": [{"type":"record", "name":"from", "fields": 
[{"type":"array", "items":"string"}, "null"]}, "null"]},

That is, to pluralize everything and then individually name array elements. I 
will try running this through my stack.


On Mon, Apr 2, 2012 at 9:13 AM, Scott Carey <scottca...@apache.org> wrote:
It appears as though the Avro to PigStorage schema translation names (in pig) 
all arrays ARRAY_ELEM.  The nullable wrapper is 'visible' and the field name is 
not moved onto the bag name.   

About a year and a half ago I started
https://issues.apache.org/jira/browse/AVRO-592

but before finishing it AvroStorage was written elsewhere.  I don't recall 
exactly what I did with the schema translation there, but I recall the mapping 
from an Avro schema to pig tried to hide the nullable wrappers more.


In Avro, arrays are unnamed types, so I see two things you could probably do 
without any code changes:

* Add a line in the pig script to project / rename the fields to what you want 
(unfortunate and clumbsy, but I think it will work — I think you want 
"from::PIG_WRAPPER::ARRAY_ELEM as from"  or 
"FLATTEN(from::PIG_WRAPPER)::ARRAY_ELEM as from" something like that.
* Add a record wrapper to your schema (which may inject more messiness in the 
pig schema view):
{
    "namespace": "agile.data.avro",
    "name": "Email",
    "type": "record",
    "fields": [
        {"name":"message_id", "type": ["string", "null"]},
        {"name":"from","type": [{"type":"record", "name":"From", "fields": 
[[{"type":"array", "items":"string"},"null"]], "null"]},
       …
    ]
}

But that is very awkward — requiring a named record for each field that is an 
unnamed type.


Ideally PigStorage would treat any union of null and one other thing as a 
simple pig type with no wrapper, and project the name of a field or record into 
the name of the thing inside a bag.


-Scott

On 3/29/12 6:05 PM, "Russell Jurney" <russell.jur...@gmail.com> wrote:

Is it possible to name string elements in the schema of an array?  
Specifically, below I want to name the email addresses in the 
from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by 
Pig's AvroStorage.  I know I can probably fix this in Java in the Pig 
AvroStorage UDF, but I'm hoping I can also fix it more easily in the schema.  
Last time I read Avro's array docs in this context, my hit-points dropped by a 
third, so pardom me if I've not rtfm this time :)

Complete description of what I'm doing follows:

Avro schema for my emails:

{
    "namespace": "agile.data.avro",
    "name": "Email",
    "type": "record",
    "fields": [
        {"name":"message_id", "type": ["string", "null"]},
        {"name":"from","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"to","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"cc","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"bcc","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"reply_to", "type": [{"type":"array", "items":"string"}, 
"null"]},
        {"name":"in_reply_to", "type": [{"type":"array", "items":"string"}, 
"null"]},
        {"name":"subject", "type": ["string", "null"]},
        {"name":"body", "type": ["string", "null"]},
        {"name":"date", "type": ["string", "null"]}
    ]
}

Pig to publish my Avros:

grunt> emails = load '/me/tmp/emails' using AvroStorage();
grunt> describe emails

emails: {message_id: chararray,from: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},to: 
{PIG_WRAPPER: (ARRAY_ELEM: chararray)},cc: {PIG_WRAPPER: (ARRAY_ELEM: 
chararray)},bcc: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},reply_to: {PIG_WRAPPER: 
(ARRAY_ELEM: chararray)},in_reply_to: {PIG_WRAPPER: (ARRAY_ELEM: 
chararray)},subject: chararray,body: chararray,date: chararray}

grunt> store emails into 'mongodb://localhost/agile_data.emails' using 
MongoStorage();

My emails in MongoDB:

> db.emails.findOne()
{
        "_id" : ObjectId("4f738a35414e113e75707b97"),
        "message_id" : "<4f71abddc19ec_145449e389847...@li169-134.mail>",
        "from" : [
                {
                        "ARRAY_ELEM" : "da...@jobchangealerts.com"
                }
        ],
        "to" : [
                {
                        "ARRAY_ELEM" : "russell.jur...@gmail.com"
                }
        ],
        "cc" : null,
        "bcc" : null,
        "reply_to" : null,
        "in_reply_to" : null,
        "subject" : "Daily Job Change Alerts from SalesLoft",
        "body" : "Daily Job Change Alerts from SalesLoft",
        "date" : "2012-03-27T08:00:29"
}

My email on screen:



My face when I see ARRAY_ELEM, because it means more complex presentation code: 
:(
-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com



-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com



-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com



-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com

Reply via email to