The fix was this: { "type":"record", "name":"Email", "fields": [ { "name":"message_id", "type":["null","string"], "doc":"" }, { "name":"in_reply_to", "type": ["string", "null"] }, { "name":"subject", "type": ["string", "null"] }, { "name":"body", "type": ["string", "null"] }, { "name":"date", "type": ["string", "null"] }, { "name":"froms", "type": [ "null", { "type":"array", "items": [ "null", { "type":"record", "name":"from", "fields": [ { "name":"real_name", "type":["null","string"], "doc":"" }, { "name":"address", "type":["null","string"], "doc":"" } ] } ] } ], "doc":"" }, { "name":"tos", "type": [ "null", { "type":"array", "items": [ "null", { "type":"record", "name":"to", "fields": [ { "name":"real_name", "type":["null","string"], "doc":"" }, { "name":"address", "type":["null","string"], "doc":"" } ] } ] } ], "doc":"" }, { "name":"ccs", "type": [ "null", { "type":"array", "items": [ "null", { "type":"record", "name":"cc", "fields": [ { "name":"real_name", "type":["null","string"], "doc":"" }, { "name":"address", "type":["null","string"], "doc":"" } ] } ] } ], "doc":"" }, { "name":"bccs", "type": [ "null", { "type":"array", "items": [ "null", { "type":"record", "name":"bcc", "fields": [ { "name":"real_name", "type":["null","string"], "doc":"" }, { "name":"address", "type":["null","string"], "doc":"" } ] } ] } ], "doc":"" }, { "name":"reply_tos", "type": [ "null", { "type":"array", "items": [ "null", { "type":"record", "name":"reply_to", "fields": [ { "name":"real_name", "type":["null","string"], "doc":"" }, { "name":"address", "type":["null","string"], "doc":"" } ] } ] } ], "doc":"" } ] }
On Tue, Apr 10, 2012 at 2:36 AM, Russell Jurney <russell.jur...@gmail.com> wrote: Hmmmm unable to get this to work: { "namespace": "agile.data.avro", "name": "Email", "type": "record", "fields": [ {"name":"message_id", "type": ["string", "null"]}, {"name":"froms","type": [{"type":"record", "name":"from", "fields": [{"type":"array", "items":"string"}, "null"]}, "null"]}, {"name":"tos","type": [{"type":"record", "name":"to", "fields": [{"type":"array", "items":"string"}, "null"]}, "null"]}, {"name":"ccs","type": [{"type":"record", "name":"cc", "fields": [{"type":"array", "items":"string"}, "null"]}, "null"]}, {"name":"bccs","type": [{"type":"record", "name":"bcc", "fields": [{"type":"array", "items":"string"}, "null"]}, "null"]}, {"name":"reply_tos","type": [{"type":"record", "name":"reply_to", "fields": [{"type":"array", "items":"string"}, "null"]}, "null"]}, {"name":"in_reply_to", "type": [{"type":"array", "items":"string"}, "null"]}, {"name":"subject", "type": ["string", "null"]}, {"name":"body", "type": ["string", "null"]}, {"name":"date", "type": ["string", "null"]} ] } On Tue, Apr 10, 2012 at 2:26 AM, Russell Jurney <russell.jur...@gmail.com> wrote: In thinking about it more... it seems that unfortunately, the only thing I can really do is to change the schema for all email address fields: {"name":"from","type": [{"type":"array", "items":"string"}, "null"]}, to: {"name":"froms","type": [{"type":"record", "name":"from", "fields": [{"type":"array", "items":"string"}, "null"]}, "null"]}, That is, to pluralize everything and then individually name array elements. I will try running this through my stack. On Mon, Apr 2, 2012 at 9:13 AM, Scott Carey <scottca...@apache.org> wrote: It appears as though the Avro to PigStorage schema translation names (in pig) all arrays ARRAY_ELEM. The nullable wrapper is 'visible' and the field name is not moved onto the bag name. About a year and a half ago I started https://issues.apache.org/jira/browse/AVRO-592 but before finishing it AvroStorage was written elsewhere. I don't recall exactly what I did with the schema translation there, but I recall the mapping from an Avro schema to pig tried to hide the nullable wrappers more. In Avro, arrays are unnamed types, so I see two things you could probably do without any code changes: * Add a line in the pig script to project / rename the fields to what you want (unfortunate and clumbsy, but I think it will work — I think you want "from::PIG_WRAPPER::ARRAY_ELEM as from" or "FLATTEN(from::PIG_WRAPPER)::ARRAY_ELEM as from" something like that. * Add a record wrapper to your schema (which may inject more messiness in the pig schema view): { "namespace": "agile.data.avro", "name": "Email", "type": "record", "fields": [ {"name":"message_id", "type": ["string", "null"]}, {"name":"from","type": [{"type":"record", "name":"From", "fields": [[{"type":"array", "items":"string"},"null"]], "null"]}, … ] } But that is very awkward — requiring a named record for each field that is an unnamed type. Ideally PigStorage would treat any union of null and one other thing as a simple pig type with no wrapper, and project the name of a field or record into the name of the thing inside a bag. -Scott On 3/29/12 6:05 PM, "Russell Jurney" <russell.jur...@gmail.com> wrote: Is it possible to name string elements in the schema of an array? Specifically, below I want to name the email addresses in the from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by Pig's AvroStorage. I know I can probably fix this in Java in the Pig AvroStorage UDF, but I'm hoping I can also fix it more easily in the schema. Last time I read Avro's array docs in this context, my hit-points dropped by a third, so pardom me if I've not rtfm this time :) Complete description of what I'm doing follows: Avro schema for my emails: { "namespace": "agile.data.avro", "name": "Email", "type": "record", "fields": [ {"name":"message_id", "type": ["string", "null"]}, {"name":"from","type": [{"type":"array", "items":"string"}, "null"]}, {"name":"to","type": [{"type":"array", "items":"string"}, "null"]}, {"name":"cc","type": [{"type":"array", "items":"string"}, "null"]}, {"name":"bcc","type": [{"type":"array", "items":"string"}, "null"]}, {"name":"reply_to", "type": [{"type":"array", "items":"string"}, "null"]}, {"name":"in_reply_to", "type": [{"type":"array", "items":"string"}, "null"]}, {"name":"subject", "type": ["string", "null"]}, {"name":"body", "type": ["string", "null"]}, {"name":"date", "type": ["string", "null"]} ] } Pig to publish my Avros: grunt> emails = load '/me/tmp/emails' using AvroStorage(); grunt> describe emails emails: {message_id: chararray,from: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},cc: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},bcc: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},reply_to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},in_reply_to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},subject: chararray,body: chararray,date: chararray} grunt> store emails into 'mongodb://localhost/agile_data.emails' using MongoStorage(); My emails in MongoDB: > db.emails.findOne() { "_id" : ObjectId("4f738a35414e113e75707b97"), "message_id" : "<4f71abddc19ec_145449e389847...@li169-134.mail>", "from" : [ { "ARRAY_ELEM" : "da...@jobchangealerts.com" } ], "to" : [ { "ARRAY_ELEM" : "russell.jur...@gmail.com" } ], "cc" : null, "bcc" : null, "reply_to" : null, "in_reply_to" : null, "subject" : "Daily Job Change Alerts from SalesLoft", "body" : "Daily Job Change Alerts from SalesLoft", "date" : "2012-03-27T08:00:29" } My email on screen: My face when I see ARRAY_ELEM, because it means more complex presentation code: :( -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com