I think there's a problem with your schema. {DataASet: (A1: int,A2: int,DataBSets: {DataBSet: (B1: chararray,B2: chararray)})}
should probably look like {DataASet: (A1: int,A2: int,DataBSets: {(DataBSet: (B1: chararray,B2: chararray))})} On Thu, Aug 7, 2014 at 11:22 AM, Klüber, Ralf <ralf.klue...@p3-group.com> wrote: > Hello, > > > > I am new to this list. I tried to solve this problem for the last 48h but > I am stuck. I hope someone here can hint me in the right direction. > > > > I have problems using the Pig JsonLoader and wondering if I do something > wrong or I encounter another problem. > > > > The 1st half of this post is to show I know a at least something about > what I am talking and that I did my homework. During research I found a lot > about elephant-bird but there seems to be a conflict with cloudera. This > way I am stuck as well. If you trust me already you can directly jump to > the 2nd half of my post ,-). > > > > The desired solution should work both, in Cloudera and on Amazon EMR. > > > > To proof something works. > > -------------------------- > > > > I have this data file: > > ``` > > $ cat a.json > > > {"DataASet":{"A1":1,"A2":4,"DataBSets":[{"B1":"1","B2":"1"},{"B1":"2","B2":"2"}]}} > > $ ./jq '.' a.json > > { > > "DataASet": { > > "A1": 1, > > "A2": 4, > > "DataBSets": [ > > { > > "B1": "1", > > "B2": "1" > > }, > > { > > "B1": "2", > > "B2": "2" > > } > > ] > > } > > } > > $ > > ``` > > > > I am using this Pig Script to load it. > > > > ``` Pig > > a = load 'a.json' using JsonLoader(' > > DataASet: ( > > A1:int, > > A2:int, > > DataBSets: { > > ( > > B1:chararray, > > B2:chararray > > ) > > } > > ) > > '); > > ``` > > > > In grunt everything seems ok. > > > > ``` > > grunt> describe a; > > a: {DataASet: (A1: int,A2: int,DataBSets: {(B1: chararray,B2: chararray)})} > > grunt> dump a; > > ((1,4,{(1,1),(2,2)})) > > grunt> > > ``` > > > > So far so good. > > > > Real Problem > > ------------ > > > > In fact my real data (Gigabytes) looks a little bit different. The array > is in fact an array of an object. > > > > ``` > > $ ./jq '.' b.json > > { > > "DataASet": { > > "A1": 1, > > "A2": 4, > > "DataBSets": [ > > { > > "DataBSet": { > > "B1": "1", > > "B2": "1" > > } > > }, > > { > > "DataBSet": { > > "B1": "2", > > "B2": "2" > > } > > } > > ] > > } > > } > > $ cat b.json > > > {"DataASet":{"A1":1,"A2":4,"DataBSets":[{"DataBSet":{"B1":"1","B2":"1"}},{"DataBSet":{"B1":"2","B2":"2"}}]}} > > $ > > ``` > > > > I trying to load this json with the following schema: > > > > ``` Pig > > b = load 'b.json' using JsonLoader(' > > DataASet: ( > > A1:int, > > A2:int, > > DataBSets: { > > DataBSet: ( > > B1:chararray, > > B2:chararray > > ) > > } > > ) > > '); > > ``` > > > > Again it looks good so far in grunt. > > > > ``` > > grunt> describe b; > > b: {DataASet: (A1: int,A2: int,DataBSets: {DataBSet: (B1: chararray,B2: > chararray)})} ``` > > > > I expect someting like this when dumping b: > > > > ``` > > ((1,4,{((1,1)),((2,2))})) > > ``` > > > > But I get this: > > > > ``` > > grunt> dump b; > > () > > grunt> > > ``` > > > > Obviously I am doing something wrong. An empty set hints in the direction > that the schema does not match on the input line. > > > > Any hints? Thanks in advance. > > > > Kind regards. > > Ralf >